跳到主要内容

2025-05-19-20-06

Interpretable Risk Mitigation in LLM Agent Systems

Abstract

arXiv:2505.10670v1 Announce Type: new Abstract: Autonomous agents powered by large language models (LLMs) enable novel use cases in domains where responsible action is increasingly important. Yet the inherent unpredictability of LLMs raises safety concerns about agent reliability. In this work, we explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner's Dilemma. We introduce a strategy-modification method-independent of both the game and the prompt-by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space. Steering with the good-faith negotiation feature lowers the average defection probability by 28 percentage points. We also identify feasible steering ranges for several open-source LLM agents. Finally, we hypothesise that game-theoretic evaluation of LLM agents, combined with representation-steering alignment, can generalise to real-world applications on end-user devices and embodied platforms.

摘要

由大语言模型(LLMs)驱动的自主智能体在责任行为日益重要的领域展现出新颖的应用前景。然而LLMs固有的不可预测性引发了关于智能体可靠性的安全担忧。本研究基于迭代囚徒困境的变体,在一个玩具博弈论环境中探索智能体行为。我们提出了一种独立于游戏规则和提示词的策略修改方法——通过利用稀疏自编码器潜在空间中提取的可解释特征来引导残差流。实验表明,采用诚信协商特征进行引导时,平均背叛概率降低了28个百分点。同时,我们确定了多个开源LLM智能体的可行引导范围。最后,我们提出假设:结合表征引导对齐的博弈论评估方法,可推广至终端用户设备和实体化平台的实际应用场景。


Evaluations at Work: Measuring the Capabilities of GenAI in Use

Abstract

arXiv:2505.10742v1 Announce Type: new Abstract: Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration. We present an evaluation framework that decomposes real-world tasks into interdependent subtasks, letting us track both LLM performance and users' strategies across a dialogue. Complementing this framework, we develop a suite of metrics, including a composite usage derived from semantic similarity, word overlap, and numerical matches; structural coherence; intra-turn diversity; and a novel measure of the "information frontier" reflecting the alignment between AI outputs and users' working knowledge. We demonstrate our methodology in a financial valuation task that mirrors real-world complexity. Our empirical findings reveal that while greater integration of LLM-generated content generally enhances output quality, its benefits are moderated by factors such as response incoherence, excessive subtask diversity, and the distance of provided information from users' existing knowledge. These results suggest that proactive dialogue strategies designed to inject novelty may inadvertently undermine task performance. Our work thus advances a more holistic evaluation of human-AI collaboration, offering both a robust methodological framework and actionable insights for developing more effective AI-augmented work processes.

摘要

当前的人工智能基准测试未能捕捉人机协作中混乱、多轮交互的本质。我们提出一个评估框架,将现实任务分解为相互依赖的子任务,从而能够追踪对话过程中大语言模型的性能表现与用户策略。作为该框架的补充,我们开发了一套评估指标,包括基于语义相似度、词汇重叠率和数值匹配的综合使用度、结构连贯性、轮内多样性,以及反映AI输出与用户既有知识对齐程度的创新性"信息前沿"指标。我们通过模拟真实复杂度的金融估值任务验证了这一方法论。实证研究表明:虽然更深度整合大语言模型生成内容通常能提升输出质量,但其效益会受到响应不连贯、子任务多样性过高、所提供信息与用户既有知识距离过远等因素的调节。这些发现表明,旨在注入新颖性的主动对话策略可能无意中损害任务表现。本研究由此推进了对人机协作更全面的评估,既提供了严谨的方法论框架,也为开发更有效的人工智能增强工作流程给出了可操作的见解。


Embodied AI in Machine Learning -- is it Really Embodied?

Abstract

arXiv:2505.10705v1 Announce Type: new Abstract: Embodied Artificial Intelligence (Embodied AI) is gaining momentum in the machine learning communities with the goal of leveraging current progress in AI (deep learning, transformers, large language and visual-language models) to empower robots. In this chapter we put this work in the context of "Good Old-Fashioned Artificial Intelligence" (GOFAI) (Haugeland, 1989) and the behavior-based or embodied alternatives (R. A. Brooks 1991; Pfeifer and Scheier 2001). We claim that the AI-powered robots are only weakly embodied and inherit some of the problems of GOFAI. Moreover, we review and critically discuss the possibility of cross-embodiment learning (Padalkar et al. 2024). We identify fundamental roadblocks and propose directions on how to make progress.

摘要

具身人工智能(Embodied AI)正在机器学习领域获得持续关注,其目标是通过利用当前人工智能领域(深度学习、Transformer架构、大语言模型及视觉-语言模型)的进展来增强机器人能力。本章将这项工作置于"经典人工智能"(GOFAI)(Haugeland, 1989)与基于行为或具身的替代方案(R. A. Brooks 1991; Pfeifer和Scheier 2001)的理论框架中进行探讨。我们认为当前AI驱动的机器人仅具备弱具身性,并继承了经典人工智能的某些固有问题。此外,我们系统评述并批判性讨论了跨具身学习(Padalkar等, 2024)的可能性。研究揭示了根本性障碍,并就突破方向提出了建议。


PoE-World: Compositional World Modeling with Products of Programmatic Experts

Abstract

arXiv:2505.10819v1 Announce Type: new Abstract: Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.

摘要

学习世界运作机制是构建能够适应复杂环境的人工智能代理的核心任务。传统基于深度学习的世界模型需要大量训练数据,且无法通过稀疏观察灵活更新知识。近期利用大型语言模型(LLM)进行程序合成的研究进展提供了一种替代方案,该方法可学习以源代码表示的世界模型,实现少量数据下的强泛化能力。目前,程序结构化世界模型的应用仍局限于自然语言和网格世界领域。我们提出一种新颖的程序合成方法,通过将世界模型表示为LLM合成的程序专家指数加权乘积(PoE-World),有效建模复杂的非网格世界领域。研究表明,该方法仅需少量观察即可学习复杂的随机世界模型。我们通过将习得的世界模型嵌入基于模型的规划代理进行评估,在Atari的《Pong》和《蒙特祖马的复仇》游戏中展现出高效性能及对未见过关卡的泛化能力。代码已开源,学习到的世界模型及代理游戏视频详见https://topwasu.github.io/poe-world。


Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

Abstract

arXiv:2505.10844v1 Announce Type: new Abstract: Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.

摘要

准确度仍是评估人工智能系统的标准指标,但其对模型求解过程的揭示有限。本研究提出一种基于叙事式长文本谜题的基准测试,旨在深入探究模型采用的推理策略类型。谜题特别适合此目标,因其可通过多种方法求解——既可利用创造性洞察实现简短解答,亦可采用更耗时的暴力求解法。我们通过多层次推理研究大型语言模型(LLMs),不仅关注答案正确性,更着重分析解决方案的质量与创造性。研究涵盖推理过程的多个维度:(1) 将谜题语义解析为精确的数学竞赛格式;(2) 基于数学形式生成解决方案;(3) 根据标准答案自我修正解;(4) 生成分步解决方案框架;(5) 利用提示信息。研究发现LLMs在多案例中能提出具有创造性和洞察力的解法,表明其已具备以创新方式解决新问题的部分能力。然而,当存在更高效创新解法时,模型仍存在依赖暴力求解的情况,这揭示了LLMs推理能力有待改进的方向。


Vaiage: A Multi-Agent Solution to Personalized Travel Planning

Abstract

arXiv:2505.10922v1 Announce Type: new Abstract: Planning trips is a cognitively intensive task involving conflicting user preferences, dynamic external information, and multi-step temporal-spatial optimization. Traditional platforms often fall short - they provide static results, lack contextual adaptation, and fail to support real-time interaction or intent refinement. Our approach, Vaiage, addresses these challenges through a graph-structured multi-agent framework built around large language models (LLMs) that serve as both goal-conditioned recommenders and sequential planners. LLMs infer user intent, suggest personalized destinations and activities, and synthesize itineraries that align with contextual constraints such as budget, timing, group size, and weather. Through natural language interaction, structured tool use, and map-based feedback loops, Vaiage enables adaptive, explainable, and end-to-end travel planning grounded in both symbolic reasoning and conversational understanding. To evaluate Vaiage, we conducted human-in-the-loop experiments using rubric-based GPT-4 assessments and qualitative feedback. The full system achieved an average score of 8.5 out of 10, outperforming the no-strategy (7.2) and no-external-API (6.8) variants, particularly in feasibility. Qualitative analysis indicated that agent coordination - especially the Strategy and Information Agents - significantly improved itinerary quality by optimizing time use and integrating real-time context. These results demonstrate the effectiveness of combining LLM reasoning with symbolic agent coordination in open-ended, real-world planning tasks.

摘要

规划旅行是一项认知密集型任务,涉及用户偏好的冲突、动态外部信息以及多步骤时空优化。传统平台通常存在不足——它们提供静态结果、缺乏情境适应性,且不支持实时交互或意图细化。我们的解决方案Vaiage通过基于大语言模型(LLMs)构建的图结构多智能体框架应对这些挑战,该框架兼具目标条件推荐器和序列规划器的功能。大语言模型能够推断用户意图,推荐个性化目的地和活动,并综合生成符合预算、时间安排、团队规模和天气等情境约束的行程方案。通过自然语言交互、结构化工具使用和基于地图的反馈循环,Vaiage实现了植根于符号推理与会话理解的自适应、可解释、端到端旅行规划。为评估Vaiage,我们采用基于量规的GPT-4评估和定性反馈进行了人在环实验。完整系统平均得分为8.5分(满分10分),优于无策略版本(7.2分)和无外部API版本(6.8分),尤其在可行性方面表现突出。定性分析表明,智能体协调——特别是策略智能体与信息智能体——通过优化时间利用和整合实时情境,显著提升了行程质量。这些结果验证了在开放式现实世界规划任务中,将大语言模型推理与符号化智能体协调相结合的有效性。


Code-Driven Planning in Grid Worlds with Large Language Models

Abstract

arXiv:2505.10749v1 Announce Type: new Abstract: We propose an iterative programmatic planning (IPP) framework for solving grid-based tasks by synthesizing interpretable agent policies expressed in code using large language models (LLMs). Instead of relying on traditional search or reinforcement learning, our approach uses code generation as policy synthesis, where the LLM outputs executable programs that map environment states to action sequences. Our proposed architecture incorporates several prompting strategies, including direct code generation, pseudocode-conditioned refinement, and curriculum-based prompting, but also includes an iterative refinement mechanism that updates code based on task performance feedback. We evaluate our approach using six leading LLMs and two challenging grid-based benchmarks (GRASP and MiniGrid). Our IPP framework demonstrates improvements over direct code generation ranging from 10% to as much as 10x across five of the six models and establishes a new state-of-the-art result for GRASP. IPP is found to significantly outperform direct elicitation of a solution from GPT-o3-mini (by 63% on MiniGrid to 116% on GRASP), demonstrating the viability of the overall approach. Computational costs of all code generation approaches are similar. While code generation has a higher initial prompting cost compared to direct solution elicitation ($0.08 per task vs. $0.002 per instance for GPT-o3-mini), the code can be reused for any number of instances, making the amortized cost significantly lower (by 400x on GPT-o3-mini across the complete GRASP benchmark).

摘要

我们提出了一种迭代式程序化规划(IPP)框架,通过使用大型语言模型(LLMs)合成以代码形式表达的可解释智能体策略,来解决基于网格的任务。与传统搜索或强化学习方法不同,我们的方法将代码生成作为策略合成手段,由LLM输出可执行程序,将环境状态映射为动作序列。该架构整合了多种提示策略,包括直接代码生成、伪代码条件细化以及基于课程学习的提示方法,并引入了迭代优化机制,可根据任务性能反馈更新代码。我们在六个主流LLM模型和两个具有挑战性的网格基准测试(GRASP和MiniGrid)上评估了该方法。实验表明,IPP框架在六个模型中的五个上实现了10%至10倍的性能提升,并在GRASP测试中创造了新的最优结果。相较于GPT-o3-mini直接生成解决方案的方式,IPP表现出显著优势(MiniGrid提升63%,GRASP提升116%),验证了整体方法的可行性。所有代码生成方法的计算成本相近。虽然代码生成的初始提示成本高于直接解决方案生成(GPT-o3-mini每个任务0.08美元 vs 每个实例0.002美元),但生成的代码可无限次复用,使得平摊成本显著降低(在完整GRASP基准测试中GPT-o3-mini成本降低400倍)。


Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Abstract

arXiv:2505.10981v1 Announce Type: new Abstract: Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs ×\times 8 prompting strategies ×\times 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times without extra resource-intensive inference in practice. It can serve as the test-time scaling law for majority voting. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.

摘要

近期,大型语言模型(LLM)的测试时计算规模扩展问题引发了广泛关注。然而,对于不同推理提示策略在规模扩展中的表现,现有研究仍较为有限。本文聚焦于一种标准且现实的规模扩展场景——多数投票机制,系统性地开展了6种LLM×8种提示策略×6个基准测试的实验。实验结果一致表明:随着采样次数和计算开销的增加,初始性能优越的复杂提示策略会逐渐被简单的思维链(Chain-of-Thought)策略反超。我们对此现象进行了分析并给出理论证明。此外,基于概率论提出了一种方法,可在无需额外资源密集型推理的情况下,快速准确地预测规模扩展性能,并选择大采样次数下的最优策略。该方法可作为多数投票机制下的测试时规模扩展定律。进一步地,我们根据理论分析提出两种显著提升规模扩展性能的优化方案。本研究有望推动学界重新审视复杂提示策略的作用,释放简单提示策略的潜力,并为提升测试时规模扩展性能提供新思路。


LLM-Enhanced Symbolic Control for Safety-Critical Applications

Abstract

arXiv:2505.11077v1 Announce Type: new Abstract: Motivated by Smart Manufacturing and Industry 4.0, we introduce a framework for synthesizing Abstraction-Based Controller Design (ABCD) for reach-avoid problems from Natural Language (NL) specifications using Large Language Models (LLMs). A Code Agent interprets an NL description of the control problem and translates it into a formal language interpretable by state-of-the-art symbolic control software, while a Checker Agent verifies the correctness of the generated code and enhances safety by identifying specification mismatches. Evaluations show that the system handles linguistic variability and improves robustness over direct planning with LLMs. The proposed approach lowers the barrier to formal control synthesis by enabling intuitive, NL-based task definition while maintaining safety guarantees through automated validation.

摘要

受智能制造与工业4.0的驱动,本研究提出了一种基于大语言模型的自然语言规范框架,用于解决可达-规避问题的抽象化控制器设计综合。该系统通过代码代理器解析控制问题的自然语言描述,并将其转换为可被前沿符号控制软件识别的形式化语言;同时校验代理器负责验证生成代码的正确性,并通过识别规范失配来增强安全性。评估表明,该系统能有效处理语言变异性,相较于直接使用大语言模型进行规划具有更强的鲁棒性。所提出的方法通过支持直观的自然语言任务定义降低了形式化控制综合的门槛,同时通过自动化验证机制保持了安全保证。


RAGSynth: Synthetic Data for Robust and Faithful RAG Component Optimization

Abstract

arXiv:2505.10989v1 Announce Type: new Abstract: RAG can enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms, including vanilla, planning-based, and iterative RAG, are built upon 2 cores: the retriever, which should robustly select relevant documents across complex queries, and the generator, which should faithfully synthesize responses. However, existing retrievers rely heavily on public knowledge and struggle with queries of varying logical complexity and clue completeness, while generators frequently face fidelity problems. In this work, we introduce RAGSynth, a framework that includes a data construction modeling and a corresponding synthetic data generation implementation, designed to optimize retriever robustness and generator fidelity. Additionally, we present SynthBench, a benchmark encompassing 8 domain-specific documents across 4 domains, featuring diverse query complexities, clue completeness, and fine-grained citation granularity. Leveraging RAGSynth, we generate a large-scale synthetic dataset, including single and multi-hop. Extensive experiments demonstrate that the synthetic data significantly improves the robustness of the retrievers and the fidelity of the generators. Additional evaluations confirm that RAGSynth can also generalize well across different domains. By integrating the optimized retrievers into various RAG paradigms, we consistently observe enhanced RAG system performance. We have open-sourced the implementation on https://github.com/EachSheep/RAGSynth.

摘要

RAG(检索增强生成)能够提升大语言模型在知识密集型任务中的表现。现有多种RAG范式(包括基础型、规划型和迭代型)均基于两个核心组件:检索器(需在复杂查询中稳健选择相关文档)和生成器(需忠实合成响应)。然而当前检索器过度依赖公共知识,难以应对不同逻辑复杂度与线索完整度的查询,而生成器则频繁面临保真度问题。本研究提出RAGSynth框架,包含数据构建建模与对应合成数据生成实现,旨在优化检索器鲁棒性与生成器保真度。我们同步推出SynthBench基准测试集,涵盖4个领域的8份专业文档,具有多样化查询复杂度、线索完整度及细粒度引用层级。基于RAGSynth生成的大规模合成数据集(含单跳与多跳查询)实验表明,合成数据显著提升了检索器的鲁棒性与生成器的保真度。额外评估证实RAGSynth具备良好的跨领域泛化能力。将优化后的检索器集成至各类RAG范式时,系统性能均获得持续提升。项目代码已开源:https://github.com/EachSheep/RAGSynth。


GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

Abstract

arXiv:2505.11049v1 Announce Type: new Abstract: To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/

摘要

为提升视觉语言模型(VLM)的安全性,本文提出了一种新型基于推理的VLM防护模型GuardReasoner-VL。其核心思想是通过在线强化学习激励防护模型在做出审核决策前进行审慎推理。首先,我们构建了包含12.3万样本和63.1万推理步骤的多模态推理语料库GuardReasoner-VLTrain,涵盖文本、图像及图文混合输入。基于该语料库,我们通过监督微调冷启动模型的推理能力,并进一步利用在线强化学习增强审核相关的推理能力。具体而言,为提升样本多样性和难度,我们采用拒绝采样策略并结合提出的安全感知数据拼接方法进行数据增强。此外,通过动态剪裁参数设计,在训练早期鼓励探索而后期侧重利用。为平衡性能与标记效率,我们设计了融合准确率、格式合规性和标记成本的长度感知安全奖励机制。大量实验验证了模型的优越性,其F1分数平均超越次优模型19.27%。我们在https://github.com/yueliu1999/GuardReasoner-VL/开源了GuardReasoner-VL的数据、代码及模型(3B/7B版本)。


MPS-Prover: Advancing Stepwise Theorem Proving by Multi-Perspective Search and Data Curation

Abstract

arXiv:2505.10962v1 Announce Type: new Abstract: Automated Theorem Proving (ATP) in formal languages remains a formidable challenge in AI, demanding rigorous logical deduction and navigating vast search spaces. While large language models (LLMs) have shown promising performance, existing stepwise provers often suffer from biased search guidance, leading to inefficiencies and suboptimal proof strategies. This paper introduces the Multi-Perspective Search Prover (MPS-Prover), a novel stepwise ATP system designed to overcome these limitations. MPS-Prover incorporates two key innovations: a highly effective post-training data curation strategy that prunes approximately 40% of redundant training data without sacrificing performance, and a multi-perspective tree search mechanism. This search integrates a learned critic model with strategically designed heuristic rules to diversify tactic selection, prevent getting trapped in unproductive states, and enhance search robustness. Extensive evaluations demonstrate that MPS-Prover achieves state-of-the-art performance on multiple challenging benchmarks, including miniF2F and ProofNet, outperforming prior 7B parameter models. Furthermore, our analyses reveal that MPS-Prover generates significantly shorter and more diverse proofs compared to existing stepwise and whole-proof methods, highlighting its efficiency and efficacy. Our work advances the capabilities of LLM-based formal reasoning and offers a robust framework and a comprehensive analysis for developing more powerful theorem provers.

摘要

形式化语言中的自动定理证明(ATP)始终是人工智能领域的一项艰巨挑战,其要求严格的逻辑推演能力并需应对庞大的搜索空间。尽管大语言模型(LLM)已展现出优异性能,现有逐步式证明器常因存在搜索导向偏差而导致效率低下与证明策略欠优。本文提出多视角搜索证明器(MPS-Prover),这一新型逐步式ATP系统旨在突破这些局限。MPS-Prover包含两项关键创新:其一为高效的后训练数据优化策略,可在保持性能前提下剔除约40%冗余训练数据;其二为多视角树搜索机制,该机制通过将学习型评判模型与策略性设计的启发式规则相结合,实现战术选择多样化、避免陷入无效状态并增强搜索鲁棒性。大量实验表明,MPS-Prover在miniF2F和ProofNet等多个高难度基准测试中达到最先进性能,优于此前70亿参数模型。进一步分析显示,相较于现有逐步式与整体式证明方法,MPS-Prover生成的证明过程显著更短且更具多样性,充分体现其高效性与优越性。本研究推动了基于LLM的形式推理能力发展,并为开发更强大的定理证明器提供了稳健框架与系统性分析。


Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

Abstract

arXiv:2505.11063v1 Announce Type: new Abstract: LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent's thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner modifies only the reasoning phase without altering the underlying agent framework, making it easy to deploy and widely applicable to various agent frameworks. To train the Thought-Aligner model, we construct an instruction dataset across ten representative scenarios and simulate ReAct execution trajectories, generating 5,000 diverse instructions and more than 11,400 safe and unsafe thought pairs. The model is fine-tuned using contrastive learning techniques. Experiments across three agent safety benchmarks involving 12 different LLMs demonstrate that Thought-Aligner raises agent behavioral safety from approximately 50% in the unprotected setting to 90% on average. Additionally, Thought-Aligner maintains response latency below 100ms with minimal resource usage, demonstrating its capability for efficient deployment, broad applicability, and timely responsiveness. This method thus provides a practical dynamic safety solution for the LLM-based agents.

摘要

基于大语言模型(LLM)的自主智能体具备推理、工具调用与环境交互等能力,可执行复杂的多步骤任务。行为轨迹中的内部推理过程(即思维)会显著影响工具使用与后续行动,但也可能引入潜在风险。即使智能体思维出现微小偏差,也可能引发连锁反应导致不可逆的安全事故。针对长周期行为轨迹中的安全对齐挑战,本研究提出Thought-Aligner——一种插件式动态思维校正模块。该模块采用轻量级、低资源消耗的模型,在每项行动执行前实时修正高风险思维,并将校正后的思维重新注入智能体,从而确保后续决策与工具交互的安全性。值得注意的是,Thought-Aligner仅修改推理阶段而不改变底层智能体框架,使其易于部署并广泛适用于各类智能体框架。为训练Thought-Aligner模型,我们构建了涵盖十种典型场景的指令数据集,模拟ReAct执行轨迹,生成5,000条多样化指令及超过11,400组安全/不安全思维对,并采用对比学习技术进行模型微调。在包含12种不同LLM的三个智能体安全基准测试中,实验表明Thought-Aligner能将无防护状态下约50%的行为安全率提升至平均90%。此外,该模块在极低资源消耗下保持响应延迟低于100毫秒,展现出高效部署、广泛适用和即时响应的能力。该方法为基于LLM的智能体提供了实用的动态安全解决方案。


Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking

Abstract

arXiv:2505.11065v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real-world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM-driven trading strategies is their reliance on historical back-testing, inadvertently enabling LLMs to "time travel"-leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real-time market conditions. Utilizing a multi-agent architecture, DeepFund connects directly with real-time stock market data-specifically data published after each model pretraining cutoff-to ensure fair and leakage-free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions-including ticker-level analysis, investment decision-making, portfolio management, and risk control-reveal significant practical challenges. Notably, even cutting-edge models such as DeepSeek-V3 and Claude-3.7-Sonnet incur net trading losses within DeepFund real-time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at https://github.com/HKUSTDial/DeepFund.

摘要

大型语言模型(LLMs)在金融任务中展现出显著能力,包括财务报告摘要、财报电话会议记录分析和资产分类等。然而,其在管理复杂基金投资中的实际有效性尚未得到充分评估。现有评估LLM驱动交易策略的基准存在根本性局限——依赖历史回测方法,这无意中使LLMs能够"时间穿越":利用训练语料中隐含的未来信息,从而导致潜在的信息泄露和过于乐观的性能预估。为解决该问题,我们推出DeepFund实时基金基准工具,旨在真实市场环境下严格评估LLMs。通过多智能体架构,DeepFund直接对接实时股市数据(特别采用各模型预训练截止日期后发布的数据),确保公平且无信息泄露的评估。针对全球顶尖机构的九款旗舰LLMs进行的实证测试(涵盖个股分析、投资决策、组合管理和风险控制等多维度)揭示了重大实践挑战。值得注意的是,即便是DeepSeek-V3和Claude-3.7-Sonnet等前沿模型,在DeepFund实时评估环境中也出现净交易亏损,这凸显了LLMs在主动型基金管理中的当前局限性。代码已开源:https://github.com/HKUSTDial/DeepFund。


Abstract

arXiv:2505.11122v1 Announce Type: new Abstract: Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often suffer from search inefficiency or yield poorly interpretable alpha factors. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our approach leverages the LLM's instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to bolster search efficiency and alpha factor performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy, trading performance, and improved interpretability, while offering a more efficient solution for formulaic alpha mining.

摘要

阿尔法因子挖掘在量化投资中对于从复杂金融数据中识别预测信号至关重要。传统公式化阿尔法挖掘依赖人工经验,而当代自动化方法(如基于遗传编程或强化学习的方法)常面临搜索效率低下或生成可解释性差的阿尔法因子等问题。本文提出一种创新框架,通过整合大语言模型(LLMs)与蒙特卡洛树搜索(MCTS)来克服这些局限。该方法利用LLM的指令遵循与推理能力,在MCTS驱动的探索中迭代生成并优化符号化阿尔法公式。关键创新在于通过候选因子金融回测提供的量化反馈来引导MCTS探索,从而实现对庞大搜索空间的高效遍历。此外,引入频繁子树规避机制以提升搜索效率与阿尔法因子表现。基于真实股市数据的实验结果表明,本框架通过挖掘具有更高预测精度、交易表现及增强可解释性的阿尔法因子,性能优于现有方法,同时为公式化阿尔法挖掘提供了更高效的解决方案。


Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity

Abstract

arXiv:2505.11107v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.

摘要

大语言模型(LLM)的最新进展展示了通过自生成思维链进行推理的能力。多个推理代理可以通过协作将联合推理质量提升至超越个体结果的水平。然而,此类代理通常以轮替方式交互,以增加延迟为代价换取质量提升。本文提出"群体思维"(Group Think)——由单个LLM模拟多个并发推理代理(或称思考者)。通过共享彼此部分生成进度的可见性,群体思维引入了一种新的并发推理范式,其中多个推理轨迹在词元级别动态相互适应。例如,当检测到另一线程更适合继续生成时,推理线程可能在句子中途改变其生成内容。这种细粒度的词元级协作使群体思维能够减少冗余推理并提升质量,同时显著降低延迟。此外,其并发特性允许高效利用闲置计算资源,特别适用于边缘推理场景——该场景下极小批量大小常导致本地GPU利用率不足。我们提出了一种简单且可泛化的修改方案,使现有LLM均能在本地GPU上实现群体思维。同时提出评估策略以基准测试推理延迟,并通过未经群体思维专门训练的开源LLM实证展示了延迟改进。我们希望这项工作能为未来LLM实现更复杂、更高效的协作行为以生成更优质内容开辟道路。


Feasibility with Language Models for Open-World Compositional Zero-Shot Learning

Abstract

arXiv:2505.11181v1 Announce Type: new Abstract: Humans can easily tell if an attribute (also called state) is realistic, i.e., feasible, for an object, e.g. fire can be hot, but it cannot be wet. In Open-World Compositional Zero-Shot Learning, when all possible state-object combinations are considered as unseen classes, zero-shot predictors tend to perform poorly. Our work focuses on using external auxiliary knowledge to determine the feasibility of state-object combinations. Our Feasibility with Language Model (FLM) is a simple and effective approach that leverages Large Language Models (LLMs) to better comprehend the semantic relationships between states and objects. FLM involves querying an LLM about the feasibility of a given pair and retrieving the output logit for the positive answer. To mitigate potential misguidance of the LLM given that many of the state-object compositions are rare or completely infeasible, we observe that the in-context learning ability of LLMs is essential. We present an extensive study identifying Vicuna and ChatGPT as best performing, and we demonstrate that our FLM consistently improves OW-CZSL performance across all three benchmarks.

摘要

人类可以轻松判断某个属性(或称状态)对于物体是否真实可行,例如火可以是热的,但不能是湿的。在开放世界组合零样本学习中,当所有可能的状态-物体组合都被视为未见类别时,零样本预测器的表现往往欠佳。本研究重点利用外部辅助知识来确定状态-物体组合的可行性。我们提出的语言模型可行性评估方法(FLM)是一种简单有效的方案,通过利用大型语言模型(LLMs)来更好地理解状态与物体之间的语义关系。FLM的核心操作是向LLM查询给定组合的可行性,并获取肯定答案的输出逻辑值。考虑到许多状态-物体组合较为罕见或完全不可行可能误导LLM,我们发现LLM的上下文学习能力至关重要。我们通过广泛研究确定Vicuna和ChatGPT表现最佳,并证明FLM在全部三个基准测试中持续提升了开放世界组合零样本学习的性能。


Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

Abstract

arXiv:2505.11189v1 Announce Type: new Abstract: Generative AI systems can help spread information but also misinformation and biases, potentially undermining the UN Sustainable Development Goals (SDGs). Explainable AI (XAI) aims to reveal the inner workings of AI systems and expose misbehaviours or biases. However, current XAI tools, built for simpler models, struggle to handle the non-numerical nature of large language models (LLMs). This paper examines the effectiveness of global XAI methods, such as rule-extraction algorithms and SHAP, in detecting bias in LLMs. To do so, we first show a text-to-ordinal mapping strategy to convert non-numerical inputs/outputs into numerical features, enabling these tools to identify (some) misinformation-related biases in LLM-generated content. Then, we inject non-linear biases of varying complexity (univariate, conjunctive, and non-convex) into widespread LLMs like ChatGPT and Llama via system instructions, using global XAI methods to detect them. This way, we found that RuleFit struggles with conjunctive and non-convex biases, while SHAP can approximate conjunctive biases but cannot express them as actionable rules. Hence, we introduce RuleSHAP, a global rule extraction algorithm combining SHAP and RuleFit to detect more non-univariate biases, improving injected bias detection over RuleFit by +94% (MRR@1) on average.

摘要

生成式人工智能系统在传播信息的同时也可能助长错误信息和偏见,从而可能破坏联合国可持续发展目标(SDGs)。可解释人工智能(XAI)旨在揭示AI系统的内部运作机制并暴露其不当行为或偏见。然而,当前为简单模型设计的XAI工具难以处理大型语言模型(LLMs)的非数值特性。本文研究了全局XAI方法(如规则提取算法和SHAP)在检测LLMs偏见方面的有效性。为此,我们首先提出一种文本到序数的映射策略,将非数值输入/输出转换为数值特征,使这些工具能够识别LLM生成内容中(部分)与错误信息相关的偏见。接着,我们通过系统指令向ChatGPT和Llama等主流LLMs注入不同复杂度(单变量、合取和非凸)的非线性偏见,并利用全局XAI方法进行检测。研究发现,RuleFit难以处理合取和非凸偏见,而SHAP虽能近似识别合取偏见却无法将其转化为可操作规则。为此,我们提出RuleSHAP算法——一种结合SHAP与RuleFit的全局规则提取方法,可检测更多非单变量偏见,其注入偏见的检测性能较RuleFit平均提升94%(MRR@1指标)。


Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment

Abstract

arXiv:2505.11194v1 Announce Type: new Abstract: Predicting protein function from sequence is a central challenge in computational biology. While existing methods rely heavily on structured ontologies or similarity-based techniques, they often lack the flexibility to express structure-free functional descriptions and novel biological functions. In this work, we introduce Prot2Text-V2, a novel multimodal sequence-to-text model that generates free-form natural language descriptions of protein function directly from amino acid sequences. Our method combines a protein language model as a sequence encoder (ESM-3B) and a decoder-only language model (LLaMA-3.1-8B-Instruct) through a lightweight nonlinear modality projector. A key innovation is our Hybrid Sequence-level Contrastive Alignment Learning (H-SCALE), which improves cross-modal learning by matching mean- and std-pooled protein embeddings with text representations via contrastive loss. After the alignment phase, we apply instruction-based fine-tuning using LoRA on the decoder to teach the model how to generate accurate protein function descriptions conditioned on the protein sequence. We train Prot2Text-V2 on about 250K curated entries from SwissProt and evaluate it under low-homology conditions, where test sequences have low similarity with training samples. Prot2Text-V2 consistently outperforms traditional and LLM-based baselines across various metrics.

摘要

预测蛋白质功能是计算生物学领域的核心挑战。现有方法主要依赖结构化本体论或基于相似性的技术,往往难以灵活表达非结构化的功能描述和新型生物学功能。本研究提出Prot2Text-V2——一种新型多模态序列到文本模型,可直接从氨基酸序列生成自由形式的蛋白质功能自然语言描述。该方法通过轻量级非线性模态投影器,将蛋白质语言模型(ESM-3B)作为序列编码器与仅解码语言模型(LLaMA-3.1-8B-Instruct)相结合。关键创新在于混合序列级对比对齐学习(H-SCALE),通过对比损失将均值池化和标准差池化的蛋白质嵌入与文本表示进行匹配,从而提升跨模态学习效果。在完成对齐阶段后,我们采用基于指令的LoRA微调方法训练解码器,使模型能够根据蛋白质序列生成准确的功能描述。Prot2Text-V2在SwissProt约25万条精选条目上完成训练,并在低同源条件下(测试序列与训练样本相似度低)进行评估。实验结果表明,该模型在各项指标上均优于传统方法和基于大语言模型的基线系统。


SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Abstract

arXiv:2505.11274v1 Announce Type: new Abstract: Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.

摘要

近期,大型推理模型在各类任务中展现出卓越性能。然而,这些模型在处理简单与复杂查询时均存在低效过度处理现象,导致资源浪费和用户延迟增加。为应对这一挑战,我们提出SelfBudgeter——一种自适应可控的高效推理策略。该方法采用双阶段训练范式:首先,模型学习基于查询难度预先估算推理成本;其次,我们引入预算引导的GPRO强化学习方法,在保持精度的同时有效缩减输出长度。SelfBudgeter使用户能够预判生成时间,并据此做出继续或中断过程的决策。此外,该方法支持通过预填充令牌预算直接调控推理长度。实验结果表明,SelfBudgeter能根据问题复杂度合理分配预算,在MATH基准测试中实现最高74.47%的响应长度压缩,同时保持几乎无损的准确率。


LD-Scene: LLM-Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios

Abstract

arXiv:2505.11247v1 Announce Type: new Abstract: Ensuring the safety and robustness of autonomous driving systems necessitates a comprehensive evaluation in safety-critical scenarios. However, these safety-critical scenarios are rare and difficult to collect from real-world driving data, posing significant challenges to effectively assessing the performance of autonomous vehicles. Typical existing methods often suffer from limited controllability and lack user-friendliness, as extensive expert knowledge is essentially required. To address these challenges, we propose LD-Scene, a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) for user-controllable adversarial scenario generation through natural language. Our approach comprises an LDM that captures realistic driving trajectory distributions and an LLM-based guidance module that translates user queries into adversarial loss functions, facilitating the generation of scenarios aligned with user queries. The guidance module integrates an LLM-based Chain-of-Thought (CoT) code generator and an LLM-based code debugger, enhancing the controllability and robustness in generating guidance functions. Extensive experiments conducted on the nuScenes dataset demonstrate that LD-Scene achieves state-of-the-art performance in generating realistic, diverse, and effective adversarial scenarios. Furthermore, our framework provides fine-grained control over adversarial behaviors, thereby facilitating more effective testing tailored to specific driving scenarios.

摘要

确保自动驾驶系统的安全性和鲁棒性需要在安全关键场景中进行全面评估。然而,这类安全关键场景在现实驾驶数据中极为罕见且难以采集,这对有效评估自动驾驶车辆性能构成了重大挑战。现有典型方法通常存在可控性有限和用户友好性不足的问题,因其本质上需要大量专家知识。为解决这些挑战,我们提出了LD-Scene——一个将大语言模型(LLMs)与潜在扩散模型(LDMs)相结合的新型框架,通过自然语言实现用户可控的对抗场景生成。该框架包含一个捕捉真实驾驶轨迹分布的LDM,以及一个基于LLM的引导模块,该模块将用户查询转化为对抗性损失函数,从而生成符合用户需求的场景。引导模块整合了基于LLM的思维链(CoT)代码生成器和基于LLM的代码调试器,提升了生成引导函数的可控性和鲁棒性。在nuScenes数据集上进行的大量实验表明,LD-Scene在生成真实、多样且有效的对抗场景方面达到了最先进水平。此外,我们的框架提供了对对抗行为的细粒度控制,从而能够针对特定驾驶场景开展更有效的测试。


Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs

Abstract

arXiv:2505.11227v1 Announce Type: new Abstract: The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.

摘要

推理能力的发展是大型语言模型(LLM)研究的关键前沿,其中强化学习(RL)和过程奖励模型(PRM)已成为主流方法论框架。与传统观点相反,DeepSeek-R1的实证研究表明,专注于数学问题解决的纯RL训练无需整合PRM即可逐步提升推理能力,这一发现对过程监督的必要性提出了挑战。本研究系统性地探讨了RL训练与PRM能力之间的关系,发现解题能力与过程监督能力是推理的两个互补维度,在纯RL训练过程中会协同演化。值得注意的是,当应用于DeepSeek-R1和QwQ-32B等前沿模型时,现有PRM的表现甚至不及多数投票等简单基线方法。为突破这一局限,我们提出Self-PRM框架——该自省机制使模型通过自我奖励机制自主评估并重新排序生成的解决方案。尽管Self-PRM能持续提升基准测试准确率(尤其在更大样本量时),分析仍揭示出持续存在的挑战:该方法在难题上表现出的精确度较低(<10%),经常将存在缺陷的解决方案误判为有效。这些分析表明需要持续扩展RL规模以改进奖励对齐和自省准确性。总体而言,我们的研究结果表明PRM对于增强复杂推理可能并非必需,因为纯RL不仅能提升问题解决能力,还能内生地培育强大的PRM能力。希望这些发现能为构建更可靠、更具自我意识的复杂推理模型提供可操作的见解。


TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes

Abstract

arXiv:2505.11270v1 Announce Type: new Abstract: The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data analytics system. Specifically, we propose a novel architecture built upon the Model Context Protocol (MCP), an emerging paradigm that enables LLMs to collaborate with knowledgeable agents. First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes and develop an AI-agent-powered NL2Operator translator to bridge user intent and analytical execution. Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities. This design enhances both accuracy and efficiency, while supporting high scalability through modular deployment. Finally, we propose a updating mechanism by harnessing the deep research and machine unlearning techniques to refresh the data lakes and LLM knowledges, with the goal of balancing the data freshness and inference efficiency.

摘要

数据湖中数据的多样性给数据分析带来了重大挑战,数据科学家需要同时分析包括结构化、半结构化和非结构化数据在内的多模态数据。尽管大语言模型(LLMs)已展现出良好的能力,但在准确性、效率和时效性方面仍不足以满足多模态数据分析的需求。首先,当前的自然语言(NL)或类SQL查询语言可能难以精确且全面地捕捉用户的分析意图;其次,依赖单一统一的大语言模型处理多样化的数据模态通常会导致显著的推理开销;第三,数据湖中存储的数据可能存在不完整或过时问题,因此必须整合外部开放域知识以生成及时相关的分析结果。

本文提出了一种新型多模态数据分析系统。具体而言,我们设计了一种基于模型上下文协议(MCP)的创新架构,该新兴范式可使大语言模型与知识代理协同工作。首先,我们定义了专为查询数据湖多模态数据设计的语义操作符层次结构,并开发了由AI代理驱动的自然语言到操作符转换器(NL2Operator),以桥接用户意图与分析执行。其次,我们提出了基于MCP的执行框架,其中每个MCP服务器托管针对特定数据模态优化的专用基础模型,该设计在提升准确性和效率的同时,通过模块化部署支持高度可扩展性。最后,我们通过深度研究和机器遗忘技术构建更新机制,以刷新数据湖和大语言模型知识,旨在平衡数据时效性与推理效率。


TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Abstract

arXiv:2505.11329v1 Announce Type: new Abstract: Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLINK. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Further, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The computation of one subset is then overlapped with the communication of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce-RMSNorm kernel carefully leveraging Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 29% latency gains and up to 26% throughput gains across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.

摘要

大型语言模型(LLM)的分布式推理即使在通过NVLINK等高速互连连接的GPU上也会产生高达20%的开销。目前已提出多种技术通过将计算分解为更细粒度的任务,并在子任务完成时重叠通信来缓解这些开销。然而,在GPU上将大规模计算细粒度分解为多个小计算会产生额外开销。此外,通信本身会占用大量流式多处理器(SM),进一步增加了开销。

我们提出TokenWeave来解决这些挑战。TokenWeave采用了一种令牌分割技术,以波感知方式将推理批次的令牌划分为两个近似相等的子集,使一个子集的计算与另一个子集的通信重叠执行。此外,TokenWeave优化了层归一化计算相对于通信操作的执行顺序,并实现了一种新颖的融合式AllReduce-RMSNorm内核,充分利用NVIDIA Hopper GPU的多存储器指令支持。这些优化使TokenWeave仅需2-8个SM即可完成通信和RMSNorm操作。我们的内核还实现了内存受限的RMSNorm与其他批次计算的重叠,从而获得额外收益。评估结果表明,在多种模型和工作负载下,TokenWeave可实现高达29%的延迟降低和26%的吞吐量提升。在多个场景中,TokenWeave的性能甚至优于移除了所有通信的等效模型。


LLM-Explorer: Towards Efficient and Affordable LLM-based Exploration for Mobile Apps

Abstract

arXiv:2505.10593v1 Announce Type: cross Abstract: Large language models (LLMs) have opened new opportunities for automated mobile app exploration, an important and challenging problem that used to suffer from the difficulty of generating meaningful UI interactions. However, existing LLM-based exploration approaches rely heavily on LLMs to generate actions in almost every step, leading to a huge cost of token fees and computational resources. We argue that such extensive usage of LLMs is neither necessary nor effective, since many actions during exploration do not require, or may even be biased by the abilities of LLMs. Further, based on the insight that a precise and compact knowledge plays the central role for effective exploration, we introduce LLM-Explorer, a new exploration agent designed for efficiency and affordability. LLM-Explorer uses LLMs primarily for maintaining the knowledge instead of generating actions, and knowledge is used to guide action generation in a LLM-less manner. Based on a comparison with 5 strong baselines on 20 typical apps, LLM-Explorer was able to achieve the fastest and highest coverage among all automated app explorers, with over 148x lower cost than the state-of-the-art LLM-based approach.

摘要

大语言模型(LLMs)为自动化移动应用探索开辟了新途径,这一重要且具有挑战性的问题曾因难以生成有意义的用户界面交互而受阻。然而,现有基于LLM的探索方法几乎每一步都严重依赖LLM生成操作,导致高昂的令牌费用和计算资源消耗。我们认为这种对LLM的过度使用既不必要也不高效,因为探索过程中的许多操作并不需要LLM参与,甚至可能受LLM能力影响而产生偏差。进一步地,基于"精确而紧凑的知识是有效探索核心"这一洞见,我们提出了LLM-Explorer——一种兼顾高效性与经济性的新型探索智能体。该智能体主要利用LLM维护知识而非生成操作,并通过非LLM方式利用知识指导操作生成。在与20款典型应用程序上5个强基线的对比实验中,LLM-Explorer在所有自动化应用探索器中实现了最快速度和最高覆盖率,其成本较最先进的基于LLM的方法降低了148倍以上。


Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Abstract

arXiv:2505.10597v1 Announce Type: cross Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.

摘要

奖励模型(RMs)对于将大语言模型(LLMs)与人类价值观对齐至关重要。然而,人类反馈中的噪声偏好常导致奖励错误泛化,即奖励模型过度拟合虚假模式并在策略优化过程中产生误导性信号。我们系统分析了偏好对的训练动态,发现噪声样本更难拟合且会引入不稳定性。实证研究表明,使用基于完整噪声数据集训练的奖励模型优化的LLMs,其表现逊色于基于过滤后高质量偏好训练的模型。为此,我们提出协同奖励建模(CRM),这是一个通过结合同行评审和课程学习来增强鲁棒性的在线框架。两个奖励模型并行训练并相互评估数据选择以过滤潜在噪声。课程学习将偏好数据按从易到难的结构组织,确保同步训练和稳定反馈。大量实验表明,CRM在40%标签噪声下可使RewardBench准确率提升高达9.94个百分点,显著改善了泛化能力。该框架还与隐式奖励对齐方法兼容,为稳健对齐提供了实用且通用的策略。


CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation

Abstract

arXiv:2505.10594v1 Announce Type: cross Abstract: We introduce CRPE (Code Reasoning Process Enhancer), an innovative three-stage framework for data synthesis and model training that advances the development of sophisticated code reasoning capabilities in large language models (LLMs). Building upon existing system-1 models, CRPE addresses the fundamental challenge of enhancing LLMs' analytical and logical processing in code generation tasks. Our framework presents a methodologically rigorous yet implementable approach to cultivating advanced code reasoning abilities in language models. Through the implementation of CRPE, we successfully develop an enhanced COT-Coder that demonstrates marked improvements in code generation tasks. Evaluation results on LiveCodeBench (20240701-20240901) demonstrate that our COT-Coder-7B-StepDPO, derived from Qwen2.5-Coder-7B-Base, with a pass@1 accuracy of 21.88, exceeds all models with similar or even larger sizes. Furthermore, our COT-Coder-32B-StepDPO, based on Qwen2.5-Coder-32B-Base, exhibits superior performance with a pass@1 accuracy of 35.08, outperforming GPT4O on the benchmark. Overall, CRPE represents a comprehensive, open-source method that encompasses the complete pipeline from instruction data acquisition through expert code reasoning data synthesis, culminating in an autonomous reasoning enhancement mechanism.

摘要

我们提出CRPE(代码推理过程增强器),这是一种创新的三阶段数据合成与模型训练框架,旨在提升大语言模型(LLMs)的复杂代码推理能力。基于现有系统1模型,CRPE解决了增强LLMs在代码生成任务中分析与逻辑处理能力的核心挑战。该框架提供了一种方法严谨且可实施的途径,用于培养语言模型的高级代码推理能力。通过实施CRPE,我们成功开发出增强版COT-Coder,其在代码生成任务中表现出显著提升。在LiveCodeBench(20240701-20240901)的评估结果显示,基于Qwen2.5-Coder-7B-Base的COT-Coder-7B-StepDPO以21.88的pass@1准确率超越所有同规模甚至更大规模的模型。此外,基于Qwen2.5-Coder-32B-Base的COT-Coder-32B-StepDPO展现出更优异的性能,其35.08的pass@1准确率在基准测试中超越了GPT4O。总体而言,CRPE代表了一种全面的开源方法,涵盖从指令数据获取到专家级代码推理数据合成的完整流程,最终形成自主推理增强机制。


Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation

Abstract

arXiv:2505.10588v1 Announce Type: cross Abstract: This research offers a unique evaluation of how AI systems interpret the digital language of Generation Alpha (Gen Alpha, born 2010-2024). As the first cohort raised alongside AI, Gen Alpha faces new forms of online risk due to immersive digital engagement and a growing mismatch between their evolving communication and existing safety tools. Their distinct language, shaped by gaming, memes, and AI-driven trends, often conceals harmful interactions from both human moderators and automated systems. We assess four leading AI models (GPT-4, Claude, Gemini, and Llama 3) on their ability to detect masked harassment and manipulation within Gen Alpha discourse. Using a dataset of 100 recent expressions from gaming platforms, social media, and video content, the study reveals critical comprehension failures with direct implications for online safety. This work contributes: (1) a first-of-its-kind dataset capturing Gen Alpha expressions; (2) a framework to improve AI moderation systems for youth protection; (3) a multi-perspective evaluation including AI systems, human moderators, and parents, with direct input from Gen Alpha co-researchers; and (4) an analysis of how linguistic divergence increases youth vulnerability. Findings highlight the urgent need to redesign safety systems attuned to youth communication, especially given Gen Alpha reluctance to seek help when adults fail to understand their digital world. This study combines the insight of a Gen Alpha researcher with systematic academic analysis to address critical digital safety challenges.

摘要

本研究对人工智能系统如何解读α世代(Gen Alpha,2010-2024年出生群体)的数字语言进行了创新性评估。作为与AI共同成长的首个世代,α世代因深度数字参与及不断演变的沟通方式与现有安全工具之间的脱节,正面临新型网络风险。其由游戏、网络迷因和AI驱动趋势塑造的独特语言,往往使人类审核员与自动化系统都难以察觉有害互动。我们评估了四种主流AI模型(GPT-4、Claude、Gemini和Llama 3)在识别α世代话语中隐蔽骚扰与操控行为的能力。通过分析来自游戏平台、社交媒体和视频内容的100条最新表达数据集,研究揭示了直接影响网络安全的重大理解缺陷。本研究的贡献包括:(1)首个记录α世代表达特征的数据集;(2)改进青少年保护AI审核系统的框架;(3)涵盖AI系统、人类审核员及家长的多视角评估,并包含α世代合作研究者的直接反馈;(4)关于语言差异如何加剧青少年脆弱性的分析。研究结果强调:鉴于α世代在成年人无法理解其数字世界时往往不愿寻求帮助,亟需重新设计适应青少年沟通特点的安全系统。本研究结合α世代研究者的洞见与系统化学术分析,以应对关键的数字安全挑战。


Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI

Abstract

arXiv:2505.10472v1 Announce Type: cross Abstract: Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch's ANOVA, Games-Howell, and Hedges' g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.

摘要

关于乳腺癌和宫颈癌的有效传播仍是持续存在的健康挑战,公众对癌症预防、筛查和治疗的理解存在显著差距,可能导致延误诊断和治疗不足。本研究评估了大型语言模型(LLMs)在生成准确、安全且易于理解的癌症相关信息以支持患者认知方面的能力与局限。我们采用混合方法评估框架,从语言质量、安全可信度、传播可及性与情感效应三个维度,对五个通用LLMs和三个医学专用LLMs进行了评估。方法结合定量指标、定性专家评分及韦尔奇方差分析、Games-Howell检验和Hedges' g统计量。结果表明:通用LLMs在语言质量和情感效应上表现更优,而医学LLMs则展现出更强的传播可及性。然而,医学LLMs往往存在更高水平的潜在危害性、毒性及偏见,降低了其安全可信度表现。研究发现揭示了健康传播中领域专业知识与安全性之间的二元性,强调需要针对性地改进模型设计,特别是在减少危害偏见、提升安全性与情感效应方面。本研究为癌症信息传播的LLMs应用提供了全面评估,为改进AI生成健康内容及开发准确、安全、可及的数字健康工具提供了关键洞见。


MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices

Abstract

arXiv:2505.10607v1 Announce Type: cross Abstract: The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.

摘要

智能手机和物联网设备的日益普及,使得在资源受限的硬件上进行高效时间序列分析变得至关重要,这对人体活动识别和空气质量预测等传感应用尤为关键。尽管当前硬件感知的神经架构搜索(NAS)技术能针对特定平台自动发现架构,但尚未有研究专注于面向边缘部署的通用时间序列分析。本研究利用大语言模型(LLM)的问题解决与推理能力,提出创新框架MONAQ,将NAS重构为多目标神经架构查询任务。该框架配备多模态查询生成功能,可处理多模态时间序列输入与硬件约束,并通过基于LLM智能体的多目标搜索实现代码生成的部署就绪模型。通过整合数值数据、时间序列图像和文本描述,MONAQ显著提升了LLM对时间序列数据的理解能力。在十五个数据集上的实验表明,MONAQ发现的模型性能优于手工构建模型和NAS基线,同时具有更高的效率。


Towards an LLM-powered Social Digital Twinning Platform

Abstract

arXiv:2505.10681v1 Announce Type: cross Abstract: We present Social Digital Twinner, an innovative social simulation tool for exploring plausible effects of what-if scenarios in complex adaptive social systems. The architecture is composed of three seamlessly integrated parts: a data infrastructure featuring real-world data and a multi-dimensionally representative synthetic population of citizens, an LLM-enabled agent-based simulation engine, and a user interface that enable intuitive, natural language interactions with the simulation engine and the artificial agents (i.e. citizens). Social Digital Twinner facilitates real-time engagement and empowers stakeholders to collaboratively design, test, and refine intervention measures. The approach is promoting a data-driven and evidence-based approach to societal problem-solving. We demonstrate the tool's interactive capabilities by addressing the critical issue of youth school dropouts in Kragero, Norway, showcasing its ability to create and execute a dedicated social digital twin using natural language.

摘要

我们提出"社会数字孪生体"——一种创新的社会模拟工具,用于探索复杂自适应社会系统中假设情景的潜在影响。该架构由三个无缝集成的部分组成:包含真实世界数据和多维代表性合成人口的数据基础设施、基于大语言模型的智能体仿真引擎,以及支持用户通过自然语言与仿真引擎和人工智能体(即公民)进行直观交互的界面。社会数字孪生体支持实时参与,使利益相关者能够协作设计、测试和完善干预措施。该方法推动了一种数据驱动、循证决策的社会问题解决途径。我们以挪威克拉格勒市青少年辍学这一关键问题为例,展示了该工具通过自然语言创建并运行专属社会数字孪生体的交互能力。


The Hitchhikers Guide to Production-ready Trustworthy Foundation Model powered Software (FMware)

Abstract

arXiv:2505.10640v1 Announce Type: cross Abstract: Foundation Models (FMs) such as Large Language Models (LLMs) are reshaping the software industry by enabling FMware, systems that integrate these FMs as core components. In this KDD 2025 tutorial, we present a comprehensive exploration of FMware that combines a curated catalogue of challenges with real-world production concerns. We first discuss the state of research and practice in building FMware. We further examine the difficulties in selecting suitable models, aligning high-quality domain-specific data, engineering robust prompts, and orchestrating autonomous agents. We then address the complex journey from impressive demos to production-ready systems by outlining issues in system testing, optimization, deployment, and integration with legacy software. Drawing on our industrial experience and recent research in the area, we provide actionable insights and a technology roadmap for overcoming these challenges. Attendees will gain practical strategies to enable the creation of trustworthy FMware in the evolving technology landscape.

摘要

以大型语言模型(LLMs)为代表的基础模型(FMs)正在通过催生FMware(以这些FMs为核心组件的系统)重塑软件产业。在本届KDD 2025教程中,我们系统性地探讨了FMware,将精选的研究挑战目录与实际生产问题相结合。首先剖析了构建FMware的研究现状与实践经验,重点探讨了模型选型、领域专用高质量数据对齐、提示词工程优化以及自主智能体编排等技术难点。随后通过梳理系统测试、性能优化、部署实施及与传统软件集成等环节的关键问题,阐述了从演示原型到生产级系统的复杂演进路径。基于我们在该领域的工业实践与最新研究成果,提供了可操作的实施建议与技术路线图,助力应对这些挑战。参会者将获得在快速演进的技术环境中构建可信FMware的实用策略。


Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment

Abstract

arXiv:2505.10732v1 Announce Type: cross Abstract: In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at the possibility of developing a framework to leverage Large Language Models (LLMs) as an autonomous agent to execute part of the security audit, namely with the field audit. password policy compliance for Windows operating system. Through the conduct of an exploration experiment of using GPT-4 with Langchain, the agent executed the audit tasks by accurately flagging password policy violations and appeared to be more efficient than traditional manual audits. Despite its potential limitations in operational consistency in complex and dynamic environment, the framework suggests possibilities to extend further to real-time threat monitoring and compliance checks.

摘要

在当前快速变化的数字环境中,企业持续面临确保系统安全性的压力。安全审计通过确保策略落实、控制措施实施以及识别网络安全风险缓解缺口,有助于维持强大的安全态势。然而,审计通常依赖人工操作,需要耗费大量时间和成本。本文探讨了开发一种框架的可能性,利用大语言模型(LLMs)作为自主代理来执行部分安全审计工作,特别是针对Windows操作系统的密码策略合规性现场审计。通过采用GPT-4与Langchain进行探索性实验,该代理能够准确标记密码策略违规行为,执行审计任务时表现出比传统人工审计更高的效率。尽管在复杂动态环境中可能存在操作一致性的潜在限制,但该框架为扩展至实时威胁监测与合规性检查提供了可能性。


AI-enhanced semantic feature norms for 786 concepts

Abstract

arXiv:2505.10718v1 Announce Type: cross Abstract: Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people's semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.

摘要

语义特征规范作为人类概念知识研究的基石,传统方法因规范研究需耗费大量人力,始终面临概念/特征覆盖度与质量可验证性之间的权衡。本研究提出一种创新方法,通过将大语言模型(LLMs)生成的特征响应与人工规范数据集相结合,并依据可靠的人类判断验证规范质量。研究发现,经AI增强的特征规范数据集NOVA(通过人工智能优化的规范)在概念间特征密度和重叠度上显著提升,同时在预测人类语义相似性判断任务中,其表现优于纯人工规范数据集及词嵌入模型。综合结果表明,人类概念知识比现有规范数据集所捕获的内容更为丰富,且经过适当验证后,大语言模型可成为认知科学研究的强有力工具。


A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

Abstract

arXiv:2505.10717v1 Announce Type: cross Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.

摘要

GPT-4等大型语言模型的高计算成本和延迟限制了其在临床环境中的部署。小型语言模型(SLMs)提供了一种经济高效的替代方案,但其有限能力需要进行生物医学领域适应,这仍然具有挑战性。另一个瓶颈是临床数据的不可获取性和高敏感性。为解决这些挑战,我们提出了一种新颖框架,将SLMs适配为高性能临床模型。我们介绍了MediPhi系列3.8B参数SLMs,该系列采用我们的创新框架开发:基于相关医学和临床语料库(如PMC、医学指南、MedWiki等)对专家模型进行预指令微调、模型融合及临床任务对齐。为覆盖大多数临床任务,我们将CLUE基准扩展为CLUE+,规模扩大了一倍。我们的专家模型在该基准上相比基础模型(无需任何任务特定微调)实现了显著提升:医学实体识别提升64.3%,放射学报告分析提升49.5%,ICD-10编码提升44%(较GPT-4-0125高出14%)。通过模型融合将专家模型统一为MediPhi,保持了各基准的性能增益。此外,我们构建了MediFlow数据集,包含250万条高质量指令,覆盖14项医学NLP任务、98种细粒度文档类型,并支持JSON格式。通过监督微调和直接偏好优化对MediPhi进行对齐,平均进一步提升了18.9%。


Context-Aware Probabilistic Modeling with LLM for Multimodal Time Series Forecasting

Abstract

arXiv:2505.10774v1 Announce Type: cross Abstract: Time series forecasting is important for applications spanning energy markets, climate analysis, and traffic management. However, existing methods struggle to effectively integrate exogenous texts and align them with the probabilistic nature of large language models (LLMs). Current approaches either employ shallow text-time series fusion via basic prompts or rely on deterministic numerical decoding that conflict with LLMs' token-generation paradigm, which limits contextual awareness and distribution modeling. To address these limitations, we propose CAPTime, a context-aware probabilistic multimodal time series forecasting method that leverages text-informed abstraction and autoregressive LLM decoding. Our method first encodes temporal patterns using a pretrained time series encoder, then aligns them with textual contexts via learnable interactions to produce joint multimodal representations. By combining a mixture of distribution experts with frozen LLMs, we enable context-aware probabilistic forecasting while preserving LLMs' inherent distribution modeling capabilities. Experiments on diverse time series forecasting tasks demonstrate the superior accuracy and generalization of CAPTime, particularly in multimodal scenarios. Additional analysis highlights its robustness in data-scarce scenarios through hybrid probabilistic decoding.

摘要

时间序列预测在能源市场、气候分析和交通管理等应用领域具有重要意义。然而,现有方法难以有效整合外生文本数据并将其与大型语言模型(LLMs)的概率特性相协调。当前方法要么通过基础提示实现浅层的文本-时序融合,要么依赖与LLMs词元生成范式相冲突的确定性数值解码,这限制了上下文感知和分布建模能力。为解决这些局限性,我们提出CAPTime方法——一种基于上下文感知的概率多模态时间序列预测方法,该方法利用文本信息抽象和自回归LLM解码技术。我们的方法首先通过预训练时序编码器捕捉时间模式,随后通过可学习的交互机制将其与文本上下文对齐,生成联合多模态表征。通过将混合分布专家系统与冻结参数的LLMs相结合,我们在保持LLMs固有分布建模能力的同时,实现了上下文感知的概率预测。在多样化时序预测任务上的实验表明,CAPTime尤其在多模态场景下具有卓越的准确性和泛化能力。进一步分析揭示了该方法通过混合概率解码机制在数据稀缺场景下的鲁棒性优势。


A Systematic Analysis of Base Model Choice for Reward Modeling

Abstract

arXiv:2505.10775v1 Announce Type: cross Abstract: Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection (++18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.

摘要

基于人类反馈的强化学习(RLHF)及其核心奖励建模已成为训练强大大型语言模型(LLM)的关键环节。在训练高质量奖励模型(RM)时,一个常被忽视的因素是基础模型的影响——随着LLM数量的快速增长,基础模型的选择正变得愈发困难。本研究系统分析了基础模型选择对奖励建模性能的影响,结果表明:相较于最常见(即默认)选择,性能最高可提升14%。此外,我们揭示了现有基准测试与下游性能之间的强统计关联,并证明通过整合少量基准测试结果可显著提升模型选择效果(前5-10名平均提升18%)。最后,我们阐释了不同训练后处理步骤对最终性能的影响,并探索利用估计数据分布来降低性能预测误差的方法。


Enhancing Low-Resource Minority Language Translation with LLMs and Retrieval-Augmented Generation for Cultural Nuances

Abstract

arXiv:2505.10829v1 Announce Type: cross Abstract: This study investigates the challenges of translating low-resource languages by integrating Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG). Various model configurations were tested on Hakka translations, with BLEU scores ranging from 12% (dictionary-only) to 31% (RAG with Gemini 2.0). The best-performing model (Model 4) combined retrieval and advanced language modeling, improving lexical coverage, particularly for specialized or culturally nuanced terms, and enhancing grammatical coherence. A two-stage method (Model 3) using dictionary outputs refined by Gemini 2.0 achieved a BLEU score of 26%, highlighting iterative correction's value and the challenges of domain-specific expressions. Static dictionary-based approaches struggled with context-sensitive content, demonstrating the limitations of relying solely on predefined resources. These results emphasize the need for curated resources, domain knowledge, and ethical collaboration with local communities, offering a framework that improves translation accuracy and fluency while supporting cultural preservation.

摘要

本研究探讨了通过将大语言模型(LLMs)与检索增强生成(RAG)技术相结合来解决低资源语言翻译难题的方法。研究以客家话翻译为测试对象,对比了多种模型配置的效能,其BLEU评分区间为12%(仅使用词典)至31%(Gemini 2.0结合RAG)。表现最优的模型(模型4)整合了检索机制与先进语言建模技术,显著提升了词汇覆盖度(特别是专业术语与文化特定表达),并改善了语法连贯性。采用两阶段方法的模型3通过Gemini 2.0精修词典输出结果,获得26%的BLEU评分,既验证了迭代校正的价值,也揭示了领域特定表达的翻译挑战。基于静态词典的方法在处理上下文相关内容时表现欠佳,证实了仅依赖预设资源的局限性。研究结果凸显了精选资源、领域知识以及与当地社群开展伦理协作的必要性,所提出的框架在提升翻译准确度与流畅性的同时,也为文化保护提供了技术支持。


Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate

Abstract

arXiv:2505.10870v1 Announce Type: cross Abstract: This paper systematically addresses the challenges of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R3^3), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.

摘要

本文系统性地探讨了规则检索这一关键但研究不足领域的挑战。传统检索方法直接使用稀疏或密集检索器搜索相关规则以支持下游推理,但往往存在准确率低下的问题。这主要源于查询中的实例化事实与规则的抽象表征之间存在显著语义鸿沟,这种错配导致检索质量欠佳,进而对推理性能产生负面影响。为克服这些挑战,我们提出自归纳增强检索(SIAR)这一创新方法,该方法利用大语言模型(LLM)通过抽象化查询中的底层知识与逻辑结构,归纳可能有益于推理的潜在推断规则。这些归纳规则随后被用于查询增强以提升检索效果。此外,我们提出规则相关性重估计(R3^3)方法,通过评估检索规则所含抽象知识能否实例化为与查询事实相符、以及对推理的帮助程度,重新估计规则的相关性。多场景下的广泛实验验证了所提方法的有效性与普适性。


Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Abstract

arXiv:2505.10832v1 Announce Type: cross Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.

摘要

大型推理模型(LRMs)擅长在生成最终答案前产生显式的分步推理序列。然而,这种详细推理会带来显著的计算开销和延迟,尤其对于简单问题而言。为解决这种"过度思考"问题,我们探索如何为LRMs配备自适应思考能力:使其能基于问题复杂度动态决定是否进行显式推理。基于R1式蒸馏模型的研究发现,在提示词中插入简单省略号("...")可随机触发思考或无思考模式,揭示了推理行为中潜在的受控性。利用这一特性,我们提出AutoThink——一个通过分阶段奖励塑形逐步优化推理策略的多阶段强化学习(RL)框架。AutoLearn学会仅在必要时调用显式推理,而对简单任务默认采用简洁响应。在五个主流数学基准上的实验表明,相比最近的提示法和基于RL的剪枝方法,AutoThink实现了更优的准确率-效率权衡。该框架可无缝集成至任意R1式模型(包括蒸馏模型及其微调变体)。值得注意的是,在DeepSeek-R1-Distill-Qwen-1.5B模型上,AutoThink在减少52%令牌用量的同时将相对准确率提升6.4%,为LRMs建立了可扩展的自适应推理范式。


MatTools: Benchmarking Large Language Models for Materials Science Tools

Abstract

arXiv:2505.10852v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the ability of an LLM to understand materials science tools. The real-world benchmark contains 49 tasks (138 subtasks) requiring the generation of functional Python code for materials property calculations. Our evaluation of diverse LLMs yields three key insights: (1)Generalists outshine specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a standardized framework for assessing and improving LLM capabilities for materials science tool applications, facilitating the development of more effective AI systems for materials science and general scientific research.

摘要

大语言模型(LLMs)正日益应用于材料科学问题,包括文献理解、性能预测、材料发现与合金设计。与此同时,基于物理原理的计算方法已广泛发展,可用于材料性能计算。本文提出一种基准应用,通过生成并安全执行基于此类物理计算材料科学软件包的代码,评估LLMs回答材料科学问题的能力。MatTools由两个互补组件构成:材料模拟工具问答(QA)基准和真实世界工具使用基准。我们设计了一种自动化方法高效收集真实世界材料科学工具使用案例。QA基准源自pymatgen(Python材料基因组)代码库及文档,包含69,225个问答对,用于评估LLM理解材料科学工具的能力。真实世界基准包含49项任务(138项子任务),要求生成用于材料性能计算的功能性Python代码。我们对多种LLM的评估得出三个关键结论:(1)通才模型优于专才模型;(2)AI更了解AI;(3)越简单越好。MatTools为评估和提升LLM在材料科学工具应用中的能力提供了标准化框架,有助于开发更高效的面向材料科学及通用科学研究的AI系统。


Explain What You Mean: Intent Augmented Knowledge Graph Recommender Built With LLM

Abstract

arXiv:2505.10900v1 Announce Type: cross Abstract: Interaction sparsity is the primary obstacle for recommendation systems. Sparsity manifests in environments with disproportional cardinality of groupings of entities, such as users and products in an online marketplace. It also is found for newly introduced entities, described as the cold-start problem. Recent efforts to mitigate this sparsity issue shifts the performance bottleneck to other areas in the computational pipeline. Those that focus on enriching sparse representations with connectivity data from other external sources propose methods that are resource demanding and require careful domain expert aided addition of this newly introduced data. Others that turn to Large Language Model (LLM) based recommenders will quickly encounter limitations surrounding data quality and availability. In this work, we propose LLM-based Intent Knowledge Graph Recommender (IKGR), a novel framework that leverages retrieval-augmented generation and an encoding approach to construct and densify a knowledge graph. IKGR learns latent user-item affinities from an interaction knowledge graph and further densifies it through mutual intent connectivity. This addresses sparsity issues and allows the model to make intent-grounded recommendations with an interpretable embedding translation layer. Through extensive experiments on real-world datasets, we demonstrate that IKGR overcomes knowledge gaps and achieves substantial gains over state-of-the-art baselines on both publicly available and our internal recommendation datasets.

摘要

交互稀疏性是推荐系统面临的主要障碍。这种稀疏性体现在实体分组基数失衡的环境中,例如在线市场中的用户与商品。新引入实体也会出现该问题,即所谓的冷启动问题。当前缓解稀疏性的研究将性能瓶颈转移至计算流程的其他环节。那些利用外部来源的关联数据来丰富稀疏表征的方法,往往需要大量资源且依赖领域专家精心处理新增数据。而采用大语言模型(LLM)的推荐系统则很快会面临数据质量与可用性的限制。本研究提出基于LLM的意图知识图谱推荐系统(IKGR),该创新框架通过检索增强生成和编码方法构建并稠密化知识图谱。IKGR从交互知识图谱中学习用户-项目潜在关联,并通过双向意图连接进一步稠密化图谱,从而解决稀疏性问题,使模型能通过可解释的嵌入转换层做出基于意图的推荐。在真实数据集上的大量实验表明,IKGR能够克服知识鸿沟,在公开数据集和内部推荐数据集上均显著超越现有最优基线模型。


REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Abstract

arXiv:2505.10872v1 Announce Type: cross Abstract: Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, who robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark with vague REs (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 77.9%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompt and chains of thought. This work contributes to the research community of human-robot interaction (HRI) by making robot task planning more practical, particularly for non-expert users, e.g., the elderly and children.

摘要

机器人任务规划将人类指令分解为可执行的动作序列,使机器人能够完成一系列复杂任务。尽管当前基于大语言模型(LLM)的任务规划器展现出卓越性能,但这些模型均假设人类指令清晰明确。然而现实场景中的用户并非专家,其发出的机器人指令往往存在显著模糊性。语言学家指出,这种模糊性通常源于指代表达式(REs)——其含义高度依赖于对话语境和环境。这种模糊现象在机器人应重点服务的老年人与儿童群体中更为普遍。本文研究了人类指令中REs模糊性对基于LLM的机器人任务规划的影响机制及其解决方案。为此,我们首次构建了包含模糊REs的机器人任务规划基准(REI-Bench),发现REs模糊性会严重降低规划性能,导致任务成功率最高下降77.9%。通过分析发现,多数失败案例源于规划器对目标物体的识别缺失。为缓解REs问题,我们提出了一种简单有效的任务导向语境认知方法,该方法能为机器人生成清晰指令,相较于情境提示和思维链等技术实现了最先进的性能表现。本研究通过提升机器人任务规划对非专家用户(特别是老年人与儿童)的适用性,为人机交互(HRI)领域的研究做出了实质性贡献。


Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer

Abstract

arXiv:2505.10945v1 Announce Type: cross Abstract: Large Language Models (LLMs) increasingly incorporate multilingual capabilities, fueling the demand to transfer them into target language-specific models. However, most approaches, which blend the source model's embedding by replacing the source vocabulary with the target language-specific vocabulary, may constrain expressive capacity in the target language since the source model is predominantly trained on English data. In this paper, we propose Semantic Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that recycles embeddings from target language Pre-trained Language Models (PLMs) to transmit the deep representational strengths of PLM-derived embedding to LLMs. SALT derives unique regression lines based on the similarity in the overlap of the source and target vocabularies, to handle each non-overlapping token's embedding space. Our extensive experiments show that SALT significantly outperforms other transfer methods and achieves lower loss with accelerating faster convergence during language adaptation. Notably, SALT obtains remarkable performance in cross-lingual understanding setups compared to other methods. Furthermore, we highlight the scalable use of PLMs to enhance the functionality of contemporary LLMs by conducting experiments with varying architectures.

摘要

大型语言模型(LLMs)日益融入多语言能力,这促使人们需求将其转化为特定目标语言的模型。然而,大多数方法通过用目标语言特定词汇替换源词汇来混合源模型的嵌入,可能会限制目标语言中的表达能力,因为源模型主要基于英语数据进行训练。在本文中,我们提出了一种新颖的跨语言迁移技术——语义感知线性迁移(SALT),该技术回收目标语言预训练语言模型(PLMs)的嵌入,将PLM衍生嵌入的深度表征优势传递给LLMs。SALT基于源词汇与目标词汇重叠的相似性,为每个非重叠词汇的嵌入空间生成独特的回归线。我们的大量实验表明,SALT显著优于其他迁移方法,并在语言适应过程中以更快的收敛速度实现更低的损失。值得注意的是,与其他方法相比,SALT在跨语言理解任务中表现出卓越的性能。此外,我们通过不同架构的实验,强调了PLMs在增强当代LLMs功能方面的可扩展应用。


Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation

Abstract

arXiv:2505.10940v1 Announce Type: cross Abstract: Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, the exploitation of the LLM's world knowledge and logic inference ability produces a virtual logic graph that reveals dynamic and expressive knowledge of users, augmenting the recommendation performance. On the other hand, the user role aligns the user behavioral logic with the observed user feedback, refining our understanding of user behaviors. Additionally, we also show that the extracted user-item logic graph is empirically a general knowledge that can benefit a wide range of recommendation tasks, and conduct experiments on industrial and several public datasets as verification.

摘要

推荐系统通过从用户特征和历史行为中推断偏好,筛选出对用户有价值的内容/项目。主流方法遵循学习排序范式,侧重于发现和建模项目主题(如类别),并基于历史交互捕捉用户对这些主题的偏好。然而,这一范式往往忽略了对用户特征及其社会角色的建模,而这些正是影响相关兴趣和用户偏好转变的逻辑混杂因素。为弥补这一不足,我们提出了用户角色识别任务和行为逻辑建模任务,旨在显式建模用户角色并学习项目主题与用户社会角色间的逻辑关系。我们证明,通过大型语言模型(LLM)与推荐系统的高效集成框架可以显式解决这些任务,为此我们提出了TagCF方法。一方面,利用LLM的世界知识和逻辑推理能力生成虚拟逻辑图,揭示动态且富有表现力的用户知识,从而提升推荐性能;另一方面,用户角色将用户行为逻辑与观测到的用户反馈对齐,深化了对用户行为的理解。此外,我们还验证了提取的用户-项目逻辑图在实证上是一种普适性知识,可广泛应用于多种推荐任务,并在工业级及多个公开数据集上进行了实验验证。


GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction

Abstract

arXiv:2505.10939v1 Announce Type: cross Abstract: Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at https://github.com/saharsamr/Modular-LLM.

摘要

大型语言模型在零样本泛化方面常面临困难,已有多种模块化方法被提出以应对这一挑战。然而,我们提出一个关键限制仍未解决:通用知识与任务特定适应之间的耦合问题。为此,我们提出一种模块化框架,通过构建任务特定LoRA模块库与通用领域LoRA模块,实现二者的解耦。通过从每个任务特定模块中减去通用知识成分,我们获得更专注于任务相关信息的残差模块,该方法被称为通用知识减法(GenKnowSub)。利用精炼后的任务特定模块和Arrow路由算法\citep{ostapenko2024towards},我们无需额外训练即可为新输入动态选择和组合模块。基于Phi-3模型和标准Arrow基准的研究表明,使用源自英语、法语、德语等多种语言的通用知识LoRA模块,能在广泛基准测试中为单语和跨语言场景带来持续性能提升。在Phi-2模型上的进一步实验证明了GenKnowSub对较弱大型语言模型的泛化能力。完整代码与数据详见https://github.com/saharsamr/Modular-LLM。


A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Abstract

arXiv:2505.10924v1 Announce Type: cross Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.

摘要

近期,基于人工智能的计算设备交互已从基础原型工具发展为模拟图形用户界面中人类操作的、基于大型语言模型(LLM)的复杂系统。我们正见证着"计算机使用代理"(CUA)的兴起,这些代理能够自主执行诸如操作桌面应用程序、浏览网页和使用移动应用等任务。然而,随着其能力提升,这些代理也带来了新的安全与安防风险。LLM驱动推理的脆弱性,加之集成多软件组件与多模态输入的复杂性,进一步加剧了安全形势的复杂性。本文系统梳理了CUA安全与安防威胁的知识体系,通过全面文献综述围绕四个研究目标提炼发现:(i)界定适用于安全分析的CUA定义;(ii)对当前CUA面临的安防威胁进行分类;(iii)提出现有防御策略的完整分类法;(iv)总结用于评估CUA安全性与性能的主流基准测试、数据集及评价指标。基于这些洞见,本研究为未来研究者探索未知漏洞提供了结构化基础,并为从业者设计部署安全的计算机使用代理提供了可操作的指导。


Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

Abstract

arXiv:2505.10937v1 Announce Type: cross Abstract: The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.

摘要

大型推理模型(LRMs)的出现通过其在数学问题求解和代码生成等复杂任务中的卓越表现,彻底改变了自然语言处理领域。这些模型利用思维链(CoT)过程,能够模拟类人的推理策略。然而,由于缺乏全面的CoT数据集,LRMs的发展受到阻碍。现有资源通常无法提供由多个教师模型提炼的、具有连贯CoT过程的大规模推理问题,也未能涵盖描述CoT内部特征的多维属性。为应对这些挑战,我们推出了OmniThought数据集——该大规模数据集包含由两个高性能LRM作为教师模型生成并验证的200万条CoT过程。OmniThought中的每条CoT过程均标注了创新的"推理详尽度"(RV)和"认知难度"(CD)评分,这些指标分别描述了CoT详尽程度的适当性以及模型理解这些推理过程所需的认知难度水平。我们进一步建立了一个自主的数据集构建流程。基于不同规模的Qwen2.5模型进行的广泛实验表明,我们提出的评分体系对LRM训练效果具有积极影响。基于OmniThought数据集,我们进一步训练并发布了一系列高性能LRM,这些模型特别配备了更强的推理能力以及最优的CoT输出长度与难度级别。本研究的贡献显著提升了用于解决复杂任务的LRM的开发与训练水平。


Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents

Abstract

arXiv:2505.10961v1 Announce Type: cross Abstract: Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to enhance automated vulnerability detection. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Through extensive experiments using GPT-3.5 and GPT-4o we demonstrate that Vultrial outperforms single-agent and multi-agent baselines. Using GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baseline. Additionally, we show that role-specific instruction tuning in multi-agent with small data (50 pair samples) improves the performance of VulTrial further by 139.89% and 118.30%. Furthermore, we analyze the impact of increasing the number of agent interactions on VulTrial's overall performance. While multi-agent setups inherently incur higher costs due to increased token usage, our findings reveal that applying VulTrial to a cost-effective model like GPT-3.5 can improve its performance by 69.89% compared to GPT-4o in a single-agent setting, at a lower overall cost.

摘要

源代码漏洞检测仍是一项关键而具有挑战性的任务,尤其是当良性代码与漏洞函数存在高度相似性时。本研究提出VulTrial——一个受法庭审判启发的多智能体框架,旨在提升自动化漏洞检测能力。该框架部署了四个角色化智能体:安全研究员、代码作者、调解员和评审委员会。基于GPT-3.5和GPT-4o的广泛实验表明,VulTrial在性能上显著超越单智能体和多智能体基线。使用GPT-4o时,VulTrial相较各自基线分别提升102.39%和84.17%的性能。此外,研究发现小样本数据(50对样本)下的角色化指令微调可使VulTrial性能进一步提升139.89%和118.30%。我们还分析了增加智能体交互次数对整体性能的影响。虽然多智能体设置因令牌消耗增加导致更高成本,但实验表明:在单智能体场景下,将VulTrial应用于GPT-3.5等经济型模型时,能以更低总成本实现比GPT-4o高69.89%的性能提升。


Humans expect rationality and cooperation from LLM opponents in strategic games

Abstract

arXiv:2505.11011v1 Announce Type: cross Abstract: As Large Language Models (LLMs) integrate into our social and economic interactions, we need to deepen our understanding of how humans respond to LLMs opponents in strategic settings. We present the results of the first controlled monetarily-incentivised laboratory experiment looking at differences in human behaviour in a multi-player p-beauty contest against other humans and LLMs. We use a within-subject design in order to compare behaviour at the individual level. We show that, in this environment, human subjects choose significantly lower numbers when playing against LLMs than humans, which is mainly driven by the increased prevalence of `zero' Nash-equilibrium choices. This shift is mainly driven by subjects with high strategic reasoning ability. Subjects who play the zero Nash-equilibrium choice motivate their strategy by appealing to perceived LLM's reasoning ability and, unexpectedly, propensity towards cooperation. Our findings provide foundational insights into the multi-player human-LLM interaction in simultaneous choice games, uncover heterogeneities in both subjects' behaviour and beliefs about LLM's play when playing against them, and suggest important implications for mechanism design in mixed human-LLM systems.

摘要

随着大型语言模型(LLMs)日益融入社会经济交互,我们亟需深入理解人类在策略环境中如何应对LLM对手。本文通过首个受控货币激励实验室实验,研究了人类在多参与者p-beauty竞赛中对阵人类与LLMs时的行为差异。采用组内设计以进行个体层面的行为比较,研究发现:在该实验环境中,受试者与LLMs对弈时选择的数字显著低于人类对手,这主要源于"零"纳什均衡选择频率的显著提升。该行为转变主要由具有高策略推理能力的受试者驱动。选择零纳什均衡策略的受试者将其决策归因于对LLM推理能力的认知,以及(出人意料地)对其合作倾向的预判。本研究为同步选择博弈中多参与者人机交互提供了基础性洞见,揭示了受试者行为及其对LLM策略认知的异质性特征,并对人机混合系统的机制设计具有重要启示。


Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

Abstract

arXiv:2505.11004v1 Announce Type: cross Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

摘要

仅通过基于网络规模数据进行下一词预测训练的大规模Transformer语言模型(LMs),在仅观察少量示例后即可解决广泛任务。这种被称为上下文学习(ICL)的能力机制仍存在争议且理解不足。部分研究认为其仅是海量数据记忆的结果,而另一些研究则主张这反映了LMs中根本性的符号算法发展。本研究引入了一套调查任务和新方法,通过利用完整的Pythia扩展套件(包括捕获渐进增加训练数据的中间检查点)系统研究ICL。通过细致探究下游任务的ICL表现,同时开展对残差流子空间的机制分析,我们证明ICL超越了训练语料的单纯"记忆",但并未达到独立符号算法的实现程度。我们的结果还阐明了ICL的多个方面,包括训练动态的影响、模型能力及机制可解释性要素。总体而言,本研究推进了对ICL及其影响的理解,为模型开发者提供了潜在改进的洞见,并为AI安全实践者奠定了更明智指南的基础。


Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

Abstract

arXiv:2505.11010v1 Announce Type: cross Abstract: The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative "Ask-Respond-Review" process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9% on MMLU-Pro and 2% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.

摘要

大型语言模型(LLMs)在对话式人工智能中的应用效果受限于其对单轮监督微调(SFT)数据的依赖,这限制了多轮对话中的上下文连贯性。现有的多轮对话数据生成方法难以同时保证指令的多样性与质量。为此,我们提出Review-Instruct框架,该框架通过包含候选者、多名评审员和主席三类代理角色的迭代式"提问-响应-评审"流程合成多轮对话。该框架通过整合评审反馈迭代优化指令,从而提升对话多样性与难度。基于Alpaca数据集构建多轮对话数据集并对LLaMA2-13B模型进行微调。在MT-Bench、MMLU-Pro和Auto-Arena上的评估表明,相较于基于LLaMA2-13B的现有最优模型,本模型分别取得MMLU-Pro指标2.9%和MT-Bench指标2%的绝对提升。消融实验证实评审阶段与多评审员机制对提升指令多样性与难度的关键作用。本研究揭示了基于评审机制的多代理框架在大规模生成高质量对话数据方面的潜力。


Group-in-Group Policy Optimization for LLM Agent Training

Abstract

arXiv:2505.10978v1 Announce Type: cross Abstract: Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to long-horizon LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals and achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over the GRPO baseline: all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

摘要

基于分组的强化学习(RL)最新进展推动了前沿大语言模型(LLM)在数学推理等单轮任务中的表现,但其在长周期LLM智能体训练中的可扩展性仍受限。与静态任务不同,智能体-环境交互会展开为多步过程,且常产生稀疏或延迟奖励,这使得跨单步的信用分配更具挑战性。本研究提出分组内策略优化(GiGPO),这是一种新型RL算法,可在保留分组RL优势(免评论家、低内存、稳定收敛)的同时,实现LLM智能体的细粒度信用分配。GiGPO采用双层结构评估相对优势:(i)在情节层面,基于完整轨迹组计算宏观相对优势;(ii)在步骤层面,通过锚定状态分组机制追溯构建步骤级分组——通过识别跨轨迹的重复环境状态,将源自同一状态的动作归为一组,从而实现微观相对优势评估。这种分层结构无需依赖辅助模型或额外 rollout 即可有效捕捉全局轨迹质量与局部步骤有效性。我们在ALFWorld和WebShop两个挑战性智能体基准上使用Qwen2.5-1.5B-Instruct与Qwen2.5-7B-Instruct评估GiGPO。关键的是,GiGPO能提供细粒度的单步信用信号,在保持相同GPU内存开销、相同LLM rollout且几乎不增加时间成本的前提下,相较GRPO基线在ALFWorld上实现>12%、在WebShop上实现>9%的性能提升。


BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Abstract

arXiv:2505.11080v1 Announce Type: cross Abstract: Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

摘要

奖励模型是将大语言模型(LLM)与人类偏好对齐的核心工具,但其训练成本高昂,需要大规模人工标注的偏好数据和强大的预训练LLM骨干网络。与此同时,高质量合成指令跟随数据集的日益普及引发了一个问题:在基于强化学习的对齐过程中,能否用更简单的基于参考指标的度量方法替代奖励模型?本文首先证明,基础的字符串匹配指标BLEU在通用指令跟随数据集上,与人类偏好的吻合度出人意料地媲美强奖励模型。基于这一发现,我们提出了BLEUBERI方法:该方法先识别具有挑战性的指令,随后直接以BLEU作为奖励函数应用组相对策略优化(GRPO)。实验表明,在四个高难度指令跟随基准测试和三种不同基础语言模型中,BLEUBERI训练得到的模型性能与奖励模型指导的强化学习模型相当。人工评估进一步证实,BLEUBERI模型的输出质量与奖励模型对齐模型持平。此外,BLEUBERI模型生成的输出在事实准确性上优于其他竞争方法。总体而言,我们证明当获得高质量参考输出(可通过现有指令跟随数据集或合成数据生成轻松获取)时,基于字符串匹配的指标是对齐过程中廉价而有效的奖励模型替代方案。代码和数据已发布于https://github.com/lilakk/BLEUBERI。


Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans

Abstract

arXiv:2505.11141v1 Announce Type: cross Abstract: The goal of achieving Artificial General Intelligence (AGI) is to imitate humans and surpass them. Models such as OpenAI's o1, o3, and DeepSeek's R1 have demonstrated that large language models (LLMs) with human-like reasoning capabilities exhibit exceptional performance and are being gradually integrated into multimodal large language models (MLLMs). However, whether these models possess capabilities comparable to humans in handling reasoning tasks remains unclear at present. In this paper, we propose Human-Aligned Bench, a benchmark for fine-grained alignment of multimodal reasoning with human performance. Specifically, we collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions, encompassing four question types: visual reasoning, definition judgment, analogical reasoning, and logical judgment. More importantly, each question is accompanied by human success rates and options that humans are prone to choosing incorrectly. Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance. The findings on our benchmark provide insights into the development of the next-generation models.

摘要

实现人工通用智能(AGI)的目标在于模仿人类并超越之。OpenAI的o1、o3与DeepSeek的R1等模型表明,具备类人推理能力的大语言模型(LLMs)展现出卓越性能,并正逐步融入多模态大语言模型(MLLMs)。然而,这些模型在处理推理任务时是否具备与人类相当的能力目前尚不明确。本文提出"人类对齐基准"(Human-Aligned Bench),用于多模态推理与人类表现的细粒度对齐评估。具体而言,我们收集了9,794个仅依赖上下文推理的多模态问题,包含双语(中英文)多模态问题及纯文本问题,涵盖视觉推理、定义判断、类比推理和逻辑判断四种题型。更重要的是,每个问题均附带人类正确率及易错选项。在Human-Aligned Bench上的大量实验表明,当前MLLMs在多模态推理中的表现与人类水平存在显著差异。本基准的研究结果为下一代模型的开发提供了重要启示。


Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Abstract

arXiv:2505.11177v1 Announce Type: cross Abstract: This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

摘要

本文提出了一种端到端的多语言图像文档信息提取与处理系统。该系统采用光学字符识别技术(Tesseract)提取英语、印地语和泰米尔语等语言的文本,随后通过包含大语言模型API(Gemini)的处理流程实现跨语言翻译、摘要生成及目标语言回译。系统还集成情感分析(TensorFlow)、主题分类(Transformers)和日期提取(正则表达式)等模块以增强文档理解能力。研究通过可交互的Gradio界面展示了该应用,有效整合了各类库、模型和API,旨在消除语言障碍并提升多语言环境下图像媒体信息的可访问性。


PARSEC: Preference Adaptation for Robotic Object Rearrangement from Scene Context

Abstract

arXiv:2505.11108v1 Announce Type: cross Abstract: Object rearrangement is a key task for household robots requiring personalization without explicit instructions, meaningful object placement in environments occupied with objects, and generalization to unseen objects and new environments. To facilitate research addressing these challenges, we introduce PARSEC, an object rearrangement benchmark for learning user organizational preferences from observed scene context to place objects in a partially arranged environment. PARSEC is built upon a novel dataset of 110K rearrangement examples crowdsourced from 72 users, featuring 93 object categories and 15 environments. We also propose ContextSortLM, an LLM-based rearrangement model that places objects in partially arranged environments by adapting to user preferences from prior and current scene context while accounting for multiple valid placements. We evaluate ContextSortLM and existing personalized rearrangement approaches on the PARSEC benchmark and complement these findings with a crowdsourced evaluation of 108 online raters ranking model predictions based on alignment with user preferences. Our results indicate that personalized rearrangement models leveraging multiple scene context sources perform better than models relying on a single context source. Moreover, ContextSortLM outperforms other models in placing objects to replicate the target user's arrangement and ranks among the top two in all three environment categories, as rated by online evaluators. Importantly, our evaluation highlights challenges associated with modeling environment semantics across different environment categories and provides recommendations for future work.

摘要

物体重排是家用机器人的关键任务,需要在不明确指令的情况下实现个性化、在已有物体的环境中进行有意义的位置摆放,并能泛化至未见过的物体和新环境。为促进相关研究,我们提出PARSEC基准测试——通过学习用户从场景上下文中体现的整理偏好,在部分已整理环境中摆放物体的任务基准。该基准基于从72名用户众包的11万条重排示例构建,涵盖93个物体类别和15种环境。我们同时提出ContextSortLM模型,这种基于大语言模型的重排系统能够通过适配用户从历史及当前场景上下文体现的偏好,在部分整理环境中摆放物体,并考虑多种有效摆放方案。我们在PARSEC基准上评估ContextSortLM及现有个性化重排方法,并辅以108名在线评估者对模型预测结果与用户偏好匹配度的排名众包实验。结果表明:利用多源场景上下文的个性化重排模型优于依赖单一上下文的模型;ContextSortLM在还原目标用户布局方面表现最优,并在在线评估者评定的全部三种环境类别中均位列前二。值得注意的是,评估揭示了不同环境类别间语义建模的挑战,并为未来研究提出了改进建议。


Scaling Reasoning can Improve Factuality in Large Language Models

Abstract

arXiv:2505.11140v1 Announce Type: cross Abstract: Recent studies on large language model (LLM) reasoning capabilities have demonstrated promising improvements in model performance by leveraging a lengthy thinking process and additional computational resources during inference, primarily in tasks involving mathematical reasoning (Muennighoff et al., 2025). However, it remains uncertain if longer reasoning chains inherently enhance factual accuracy, particularly beyond mathematical contexts. In this work, we thoroughly examine LLM reasoning within complex open-domain question-answering (QA) scenarios. We initially distill reasoning traces from advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then fine-tune a variety of models ranging from smaller, instruction-tuned variants to larger architectures based on Qwen2.5. To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces. Our experimental setup includes four baseline approaches and six different instruction-tuned models evaluated across a benchmark of six datasets, encompassing over 22.6K questions. Overall, we carry out 168 experimental runs and analyze approximately 1.7 million reasoning traces. Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts. Moreover, our analysis demonstrates that adding test-time compute and token budgets factual accuracy consistently improves by 2-8%, further confirming the effectiveness of test-time scaling for enhancing performance and consequently improving reasoning accuracy in open-domain QA tasks. We release all the experimental artifacts for further research.

摘要

近期关于大语言模型(LLM)推理能力的研究表明,通过利用冗长的思维过程和在推理阶段增加计算资源(主要应用于数学推理任务),模型性能可显著提升(Muennighoff等人,2025)。然而,更长的推理链是否本质上能提高事实准确性——尤其是在数学领域之外的场景——仍存在疑问。本研究针对开放域复杂问答(QA)场景下的LLM推理机制展开系统探究:首先从先进的大规模推理模型(QwQ-32B与DeepSeek-R1-671B)中提炼推理轨迹,随后对从小型指令调优模型到基于Qwen2.5的大型架构等多种模型进行微调。为增强推理轨迹,我们以路径形式将知识图谱中的事实信息注入推理过程。实验设置包含四种基线方法和六种指令调优模型,在涵盖22.6K个问题的六个基准数据集上进行评估。总计完成168次实验运行,分析约170万条推理轨迹。研究发现:在单次运行中,小型推理模型相较原始指令调优版本能实现显著的事实准确性提升。此外,分析表明增加测试阶段计算量和token预算可使事实准确性持续提高2-8%,这进一步验证了测试阶段扩展策略对提升开放域QA任务性能及推理准确性的有效性。我们公开全部实验材料以供后续研究。


SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Abstract

arXiv:2505.11166v1 Announce Type: cross Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named \textbf&#123;S&#125;h\textbf&#123;o&#125;rt-to-\textbf&#123;Lo&#125;ng \textbf&#123;P&#125;reference \textbf&#123;O&#125;ptimization (\textbf&#123;SoLoPO&#125;), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

摘要

尽管预训练技术在扩展上下文长度方面取得了进展,大型语言模型(LLMs)在有效利用现实世界长上下文信息方面仍面临挑战,这主要源于数据质量问题、训练效率低下以及缺乏精心设计的优化目标所导致的长上下文对齐不足。为解决这些局限,我们提出了名为 extbf&#123;S&#125;h extbf&#123;o&#125;rt-to- extbf&#123;Lo&#125;ng extbf&#123;P&#125;reference extbf&#123;O&#125;ptimization( extbf&#123;SoLoPO&#125;)的框架,将长上下文偏好优化(PO)解耦为两个组成部分:短上下文PO和短到长奖励对齐(SoLo-RA),并得到理论和实证证据的支持。具体而言,短上下文PO利用从短上下文中采样的偏好对来增强模型的上下文知识利用能力;而SoLo-RA则显式地鼓励模型在包含相同任务相关信息的短上下文和长上下文条件下,对响应保持奖励分数的一致性利用。这有助于将模型处理短上下文的能力迁移至长上下文场景。SoLoPO兼容主流偏好优化算法,同时显著提升了数据构建和训练过程的效率。实验结果表明,SoLoPO在各种长上下文基准测试中增强了所有这些算法的长度和领域泛化能力,同时在计算和内存效率方面实现了显著提升。


RanDeS: Randomized Delta Superposition for Multi-Model Compression

Abstract

arXiv:2505.11204v1 Announce Type: cross Abstract: From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.

摘要

从多模型压缩的角度来看,模型合并能够高效地内存部署多个基于同一基础模型微调的模型,但由于其任务特定参数调整(即增量)之间的干扰,会导致性能下降。本文提出将模型合并重新表述为一种压缩-检索方案,揭示了任务干扰源于模型检索过程中无关增量的求和。为解决这一问题,我们采用随机正交变换将这些向量解相关为自抵消形式。研究表明,该方法能显著减少干扰,在视觉和语言任务上均提升了性能。由于这些变换完全由随机种子定义,新增模型无需额外内存。此外,其数据与模型无关的特性使得模型的添加或移除仅需极低计算开销,支持高效灵活的多模型服务。


Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition

Abstract

arXiv:2505.11175v1 Announce Type: cross Abstract: Generative skill acquisition enables embodied agents to actively learn a scalable and evolving repertoire of control skills, crucial for the advancement of large decision models. While prior approaches often rely on supervision signals from generalist agents (e.g., LLMs), their effectiveness in complex 3D environments remains unclear; exhaustive evaluation incurs substantial computational costs, significantly hindering the efficiency of skill learning. Inspired by recent successes in verification models for mathematical reasoning, we propose VERGSA (Verifying Embodied Reasoning in Generative Skill Acquisition), a framework that systematically integrates real-time verification principles into embodied skill learning. VERGSA establishes 1) a seamless extension from verification of mathematical reasoning into embodied learning by dynamically incorporating contextually relevant tasks into prompts and defining success metrics for both subtasks and overall tasks, and 2) an automated, scalable reward labeling scheme that synthesizes dense reward signals by iteratively finalizing the contribution of scene configuration and subtask learning to overall skill acquisition. To the best of our knowledge, this approach constitutes the first comprehensive training dataset for verification-driven generative skill acquisition, eliminating arduous manual reward engineering. Experiments validate the efficacy of our approach: 1) the exemplar task pool improves the average task success rates by 21%, 2) our verification model boosts success rates by 24% for novel tasks and 36% for encountered tasks, and 3) outperforms LLM-as-a-Judge baselines in verification quality.

摘要

生成式技能获取使具身智能体能够主动学习可扩展且持续演化的控制技能库,这对大型决策模型的发展至关重要。现有方法通常依赖通用智能体(如大语言模型)的监督信号,但其在复杂3D环境中的有效性尚不明确;详尽评估会带来巨大计算成本,严重制约技能学习效率。受数学推理验证模型近期成果启发,我们提出VERGSA框架(生成式技能获取中的具身推理验证),将实时验证原则系统化融入具身技能学习。该框架实现两大创新:1)通过动态整合情境相关任务至提示词,并定义子任务与整体任务的成功指标,将数学推理验证无缝扩展至具身学习;2)自动化可扩展的奖励标注方案,通过迭代确定场景配置和子任务学习对整体技能获取的贡献度,合成密集奖励信号。据我们所知,这是首个验证驱动的生成式技能获取完整训练数据集,消除了繁琐的人工奖励工程。实验验证了方法的有效性:1)范例任务池使平均任务成功率提升21%,2)验证模型使新任务成功率提高24%、已遇任务提升36%,3)在验证质量上优于"大语言模型即评判"基线方法。


Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

Abstract

arXiv:2505.11271v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.

摘要

大型语言模型(LLMs)正日益部署于边缘和云平台,用于实时问答与检索增强生成。然而,在分布式系统中处理长上下文会导致高昂的计算开销、内存占用和网络带宽消耗。本文提出一种新颖的语义缓存方法,通过存储和复用中间上下文摘要,实现基于LLM的问答工作流中相似查询的高效信息重用。实验表明,在NaturalQuestions、TriviaQA和合成的ArXiv数据集上,该方法在保持与完整文档处理相当的答案准确率的同时,将冗余计算减少50-60%。这种方案平衡了计算成本与响应质量,对实时AI助手至关重要。


From Intent Discovery to Recognition with Topic Modeling and Synthetic Data

Abstract

arXiv:2505.11176v1 Announce Type: cross Abstract: Understanding and recognizing customer intents in AI systems is crucial, particularly in domains characterized by short utterances and the cold start problem, where recommender systems must include new products or services without sufficient real user data. Customer utterances are characterized by infrequent word co-occurences and high term variability, which poses significant challenges for traditional methods in specifying distinct user needs and preparing synthetic queries. To address this, we propose an agentic LLM framework for topic modeling and synthetic query generation, which accelerates the discovery and recognition of customer intents. We first apply hierarchical topic modeling and intent discovery to expand a human-curated taxonomy from 36 generic user intents to 278 granular intents, demonstrating the potential of LLMs to significantly enhance topic specificity and diversity. Next, to support newly discovered intents and address the cold start problem, we generate synthetic user query data, which augments real utterances and reduces dependency on human annotation, especially in low-resource settings. Topic model experiments show substantial improvements in coherence and relevance after topic expansion, while synthetic data experiments indicate that in-class few-shot prompting significantly improves the quality and utility of synthetic queries without compromising diversity. We also show that LLM-generated intent descriptions and keywords can effectively substitute for human-curated versions when used as context for synthetic query generation. Our research underscores the scalability and utility of LLM agents in topic modeling and highlights the strategic use of synthetic utterances to enhance dataset variability and coverage for intent recognition. We present a comprehensive and robust framework for online discovery and recognition of new customer intents in dynamic domains.

摘要

理解并识别AI系统中的客户意图至关重要,尤其在具有短话语特征和冷启动问题的领域——这些场景下的推荐系统必须在缺乏真实用户数据的情况下纳入新产品或服务。客户话语呈现出低频词共现和高术语变异性的特点,这对传统方法在明确用户需求及生成合成查询方面构成重大挑战。为此,我们提出一个基于代理大语言模型(LLM)的主题建模与合成查询生成框架,以加速客户意图的发现与识别。我们首先应用层次化主题建模和意图发现技术,将人工整理的分类体系从36个通用用户意图扩展至278个细粒度意图,证明了大语言模型在显著提升主题特异性和多样性方面的潜力。其次,为支持新发现的意图并解决冷启动问题,我们生成合成用户查询数据以增强真实话语,减少对人类标注的依赖,尤其在低资源场景中。主题模型实验显示主题扩展后的一致性和相关性得到显著提升,而合成数据实验表明类内少样本提示能在保持多样性的同时显著提高合成查询的质量与实用性。我们还证明当作为合成查询生成的上下文时,LLM生成的意图描述和关键词可有效替代人工整理版本。本研究揭示了大语言模型代理在主题建模中的可扩展性与实用性,并阐明了利用合成话语增强意图识别数据集变异性和覆盖度的策略价值。我们提出了一套全面而鲁棒的框架,用于动态领域中新客户意图的在线发现与识别。


Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

Abstract

arXiv:2505.11277v1 Announce Type: cross Abstract: Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.

摘要

大语言模型展现出卓越的推理能力,但其性能本质上受限于知识储备。检索增强推理通过允许大语言模型查询外部资源来缓解这一局限,但现有方法常检索到无关或噪声信息,阻碍了准确推理。本文提出AutoRefine——一种采用新型“边思考边搜索优化”范式的强化学习微调框架。该框架通过在连续搜索调用间引入显式的知识精炼步骤,使模型能在生成答案前迭代地过滤、提纯和组织证据。此外,我们采用分组相对策略优化方法,将定制化的检索特异性奖励与答案正确性奖励相结合。在单跳和多跳问答基准测试上的实验表明,AutoRefine显著优于现有方法,尤其在复杂的多跳推理场景中。详细分析显示,AutoRefine能发起更频繁且更高质量的搜索,并有效合成证据。


Phare: A Safety Probe for Large Language Models

Abstract

arXiv:2505.11365v1 Announce Type: cross Abstract: Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.

摘要

确保大语言模型(LLMs)的安全性对于负责任部署至关重要,然而现有评估往往优先考虑性能而非识别故障模式。我们提出Phare这一多语言诊断框架,用于探测和评估LLMs在三个关键维度的行为:幻觉与可靠性、社会偏见以及有害内容生成。通过对17个最先进LLMs的评估,我们发现了所有安全维度上系统性漏洞的模式,包括谄媚行为、提示敏感性和刻板印象复现。通过重点揭示这些具体故障模式而非简单模型排名,Phare为研究者和实践者提供了可操作的见解,以构建更健壮、对齐且可信的语言系统。


Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

Abstract

arXiv:2505.11200v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).

摘要

大语言模型(LLM)的最新进展显著提升了文本转语音(TTS)系统的性能,增强了对语音风格、自然度及情感表达的控制,使TTS系统更接近人类水平。尽管平均意见得分(MOS)仍是TTS系统评估的标准方法,但其存在主观性、环境不一致性及可解释性有限等问题。现有评估数据集也缺乏多维设计,往往忽略说话风格、语境多样性和陷阱语句等因素,这在中文TTS评估中尤为明显。为解决这些问题,我们提出了音频图灵测试(ATT),这是一个多维中文语料数据集ATT-Corpus,并配套以简化的图灵测试启发式评估协议。ATT不依赖复杂的MOS量表或直接模型对比,而是要求评估者判断语音是否像人类发声。这种简化降低了评分偏差并提升了评估鲁棒性。为支持快速模型开发,我们还基于人类评判数据微调了Qwen2-Audio-Instruct模型作为自动评估工具Auto-ATT。实验结果表明,ATT通过其多维设计能有效区分模型在特定能力维度的表现。Auto-ATT亦展现出与人工评估的高度一致性,证实其作为快速可靠评估工具的价值。白盒化的ATT-Corpus与Auto-ATT可在ATT Hugging Face集合(https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4)中获取。


TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs

Abstract

arXiv:2505.11275v1 Announce Type: cross Abstract: Recent progress in Multimodal Large Language Models (MLLMs) have significantly enhanced the ability of artificial intelligence systems to understand and generate multimodal content. However, these models often exhibit limited effectiveness when applied to non-Western cultural contexts, which raises concerns about their wider applicability. To address this limitation, we propose the \textbf{T}raditional \textbf{C}hinese \textbf{C}ulture understanding \textbf{Bench}mark (\textbf{TCC-Bench}), a bilingual (\textit{i.e.}, Chinese and English) Visual Question Answering (VQA) benchmark specifically designed for assessing the understanding of traditional Chinese culture by MLLMs. TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts. We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage. The benchmark also avoids language bias by preventing direct disclosure of cultural concepts within question texts. Experimental evaluations across a wide range of MLLMs demonstrate that current models still face significant challenges when reasoning about culturally grounded visual content. The results highlight the need for further research in developing culturally inclusive and context-aware multimodal systems. The code and data can be found at: https://github.com/Morty-Xu/TCC-Bench.

摘要

多模态大语言模型(MLLMs)的最新进展显著提升了人工智能系统理解和生成多模态内容的能力。然而,这些模型在非西方文化语境中的应用效果往往有限,这引发了对其广泛适用性的担忧。为解决这一局限性,我们提出了中国传统文化理解基准(TCC-Bench),这是一个专为评估MLLMs对中国传统文化理解能力而设计的双语(即中文和英文)视觉问答(VQA)基准。TCC-Bench包含文化内涵丰富且视觉多样化的数据,涵盖博物馆文物、日常生活场景、漫画及其他具有文化意义的图像。我们采用半自动化流程,利用纯文本模式的GPT-4o生成候选问题,再通过人工筛选确保数据质量并避免潜在的数据泄露。该基准还通过避免在问题文本中直接披露文化概念来消除语言偏差。对多种MLLMs的实验评估表明,当前模型在基于文化的视觉内容推理方面仍面临重大挑战。结果凸显了开发具有文化包容性和情境感知能力的多模态系统的必要性。代码与数据详见:https://github.com/Morty-Xu/TCC-Bench。


DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios

Abstract

arXiv:2505.11340v1 Announce Type: cross Abstract: Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: \textit{real-world function extraction} (comprising 23,400 functions from 130 real-world programs), \textit{runtime-aware validation}, and \textit{automated human-centric assessment} using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source \href{https://github.com/Jennieett/DecompileBench&#125;&#123;DecompileBench&#125; to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.

摘要

反编译器是从漏洞发现到恶意软件分析等关键安全任务的基础工具,但其评估体系仍处于碎片化状态。现有方法主要通过合成微基准测试或主观人工评分来关注语法正确性,未能满足现实场景中对语义保真度和分析师可用性的需求。我们提出DecompileBench——首个通过三大核心组件实现逆向工程工作流中反编译器有效评估的综合框架:真实世界函数提取(包含来自130个真实程序的23,400个函数)、运行时感知验证,以及采用LLM-as-Judge的自动化人本评估,用以量化反编译器在逆向工程工作流中的效能。通过对六款工业级反编译器与六种最新LLM驱动方法的系统比较,我们发现基于LLM的方法在代码可理解性上超越商业工具,尽管其功能正确性低52.2%。这些发现揭示了LLM方法在变革人本逆向工程方面的潜力。我们开源DecompileBench,旨在为反编译器研究提供推进框架,并协助安全专家根据具体需求做出明智的工具选择。


Mergenetic: a Simple Evolutionary Model Merging Library

Abstract

arXiv:2505.11427v1 Announce Type: cross Abstract: Model merging allows combining the capabilities of existing models into a new one - post hoc, without additional training. This has made it increasingly popular thanks to its low cost and the availability of libraries that support merging on consumer GPUs. Recent work shows that pairing merging with evolutionary algorithms can boost performance, but no framework currently supports flexible experimentation with such strategies in language models. We introduce Mergenetic, an open-source library for evolutionary model merging. Mergenetic enables easy composition of merging methods and evolutionary algorithms while incorporating lightweight fitness estimators to reduce evaluation costs. We describe its design and demonstrate that Mergenetic produces competitive results across tasks and languages using modest hardware.

摘要

模型融合技术能够将现有模型的能力整合到一个新模型中——这种后处理方法无需额外训练。由于其低成本特性及支持消费级GPU运算的开源工具普及,该技术日益受到关注。近期研究表明,将融合技术与进化算法结合可显著提升性能,但目前尚无框架能支持语言模型领域对此类策略的灵活实验。我们推出Mergenetic这一进化式模型融合开源库,该工具可便捷组合多种融合方法与进化算法,并通过轻量级适应度评估器降低计算开销。本文阐述了其设计原理,并验证了Mergenetic在适度硬件条件下,能跨任务与跨语言生成具有竞争力的结果。


MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Abstract

arXiv:2505.11415v1 Announce Type: cross Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

摘要

稀疏混合专家(MoE)架构因其能高效扩展大语言模型(LLMs)而日益受到青睐,但其依赖于异构的计算和内存资源。这些因素共同影响系统的成本、准确性和性能(CAP),使得权衡不可避免。现有基准测试往往无法准确捕捉这些权衡,从而增加了实际部署决策的复杂性。为解决这一问题,我们提出了MoE-CAP,这是一个专为MoE系统设计的基准测试。我们的分析表明,在当前硬件条件下,很难在CAP三者之间实现最优平衡;MoE系统通常只能优化其中两个维度,而牺牲第三个维度——我们将这种动态称为MoE-CAP权衡。为了直观展示这一点,我们提出了CAP雷达图。此外,我们还引入了稀疏感知性能指标——稀疏内存带宽利用率(S-MBU)和稀疏模型浮点运算利用率(S-MFU),以便在不同硬件平台和部署场景下对MoE系统进行准确的性能基准测试。


Visual Planning: Let's Think Only with Images

Abstract

arXiv:2505.11409v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

摘要

大语言模型(LLMs)及其多模态扩展(MLLMs)的最新进展显著提升了机器在多样化任务中的推理能力。然而,即使存在视觉信息,这些模型仍主要依赖纯文本作为表达和结构化推理的媒介。本研究提出,语言可能并非总是最自然或最有效的推理模态,尤其是在涉及空间和几何信息的任务中。基于此,我们提出了一种新范式——视觉规划(Visual Planning),通过纯视觉表示实现独立于文本的规划。该范式中,规划通过编码逐步视觉推理的图像序列执行,类似于人类通过草图或可视化未来动作进行思考的方式。我们引入了一种新颖的强化学习框架——基于强化学习的视觉规划(VPRL),利用GRPO对大型视觉模型进行后训练,在代表性视觉导航任务(FrozenLake、Maze和MiniBehavior)中实现了规划能力的显著提升。我们的视觉规划范式在纯文本空间进行推理的所有规划变体中表现最优。研究结果表明,视觉规划是替代基于语言推理的可行且有前景的新方向,为受益于直观图像推理的任务开辟了新途径。


EdgeWisePersona: A Dataset for On-Device User Profiling from Natural Language Interactions

Abstract

arXiv:2505.11417v1 Announce Type: cross Abstract: This paper introduces a novel dataset and evaluation benchmark designed to assess and improve small language models deployable on edge devices, with a focus on user profiling from multi-session natural language interactions in smart home environments. At the core of the dataset are structured user profiles, each defined by a set of routines - context-triggered, repeatable patterns of behavior that govern how users interact with their home systems. Using these profiles as input, a large language model (LLM) generates corresponding interaction sessions that simulate realistic, diverse, and context-aware dialogues between users and their devices. The primary task supported by this dataset is profile reconstruction: inferring user routines and preferences solely from interactions history. To assess how well current models can perform this task under realistic conditions, we benchmarked several state-of-the-art compact language models and compared their performance against large foundation models. Our results show that while small models demonstrate some capability in reconstructing profiles, they still fall significantly short of large models in accurately capturing user behavior. This performance gap poses a major challenge - particularly because on-device processing offers critical advantages, such as preserving user privacy, minimizing latency, and enabling personalized experiences without reliance on the cloud. By providing a realistic, structured testbed for developing and evaluating behavioral modeling under these constraints, our dataset represents a key step toward enabling intelligent, privacy-respecting AI systems that learn and adapt directly on user-owned devices.

摘要

本文介绍了一个新颖的数据集和评估基准,旨在评估和改进可部署于边缘设备的小型语言模型,重点关注智能家居环境中多会话自然语言交互的用户画像构建。该数据集的核心是结构化用户画像,每个画像由一组日常行为模式定义——这些由情境触发、可重复的行为模式决定了用户与家居系统的交互方式。基于这些画像输入,大型语言模型(LLM)生成相应的交互会话,模拟用户与设备之间真实、多样且情境感知的对话。

该数据集支持的主要任务是画像重建:仅通过交互历史推断用户行为模式和偏好。为评估现有模型在真实条件下的表现,我们对多种最先进的紧凑型语言模型进行了基准测试,并将其性能与大型基础模型进行对比。研究结果表明,虽然小型模型展现出一定的画像重建能力,但在准确捕捉用户行为方面仍显著落后于大型模型。这种性能差距构成重大挑战——尤其因为设备端处理具有关键优势,例如保护用户隐私、降低延迟,以及在不依赖云端的情况下实现个性化体验。通过为受限条件下的行为建模开发与评估提供真实、结构化的测试平台,我们的数据集朝着实现直接在用户设备上学习与适应的、尊重隐私的智能AI系统迈出了关键一步。


MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Abstract

arXiv:2505.11432v1 Announce Type: cross Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88×\times compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

摘要

我们推出MegaScale-MoE——一个专为高效训练大规模混合专家(MoE)模型而设计的生产系统。MoE作为一种极具前景的架构,能够将大语言模型(LLMs)扩展到前所未有的规模,从而提升模型性能。然而现有MoE训练系统存在训练效率下降的问题,且随着MoE模型规模的扩大和硬件的持续迭代,该问题日益凸显。认识到高效通信对提升MoE训练的关键作用,MegaScale-MoE为每个MoE层中的注意力机制和前馈网络定制了通信高效的并行策略,并采用整体方法在算子间和算子内两个层面实现通信与计算的重叠。此外,系统通过调整通信模式实施低精度通信压缩,进一步提升训练效率。在1,440块NVIDIA Hopper GPU上训练352B参数的MoE模型时,MegaScale-MoE实现了1.41M tokens/s的训练吞吐量,相比Megatron-LM效率提升1.88倍。我们分享了加速MoE训练的实战经验,希望通过系统设计方面的洞见,推动未来MoE系统领域的研究发展。


Modeling cognitive processes of natural reading with transformer-based Language Models

Abstract

arXiv:2505.11485v1 Announce Type: cross Abstract: Recent advances in Natural Language Processing (NLP) have led to the development of highly sophisticated language models for text generation. In parallel, neuroscience has increasingly employed these models to explore cognitive processes involved in language comprehension. Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors, specifically Gaze Duration, during reading. In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship. Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers. However, similar to previous studies, these models still fail to account for the entirety of the variance captured by human predictability. These findings suggest that, despite their advancements, state-of-the-art language models continue to predict language in ways that differ from human readers.

摘要

自然语言处理(NLP)的最新进展催生了高度复杂的文本生成语言模型。与此同时,神经科学领域越来越多地采用这些模型来探索语言理解涉及的认知过程。先前研究表明,N元语法和LSTM网络等模型能够部分解释阅读过程中眼动行为(特别是凝视时间)的可预测性效应。本研究通过评估基于Transformer的模型(GPT2、LLaMA-7B和LLaMA2-7B)扩展了这些发现,进一步探究这种关系。结果表明,在解释里奥普拉滕西西班牙语读者记录的凝视时间方差时,这些架构优于早期模型。但与既往研究类似,这些模型仍无法完全解释人类可预测性所捕获的全部方差。这些发现表明,尽管取得了进步,最先进的语言模型在预测语言时仍与人类读者的方式存在差异。


Disentangling Reasoning and Knowledge in Medical Large Language Models

Abstract

arXiv:2505.11462v1 Announce Type: cross Abstract: Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.

摘要

大语言模型(LLMs)的医学推理旨在模拟临床医生的诊断思维,但当前基准测试如MedQA-USMLE、MedMCQA和PubMedQA常将推理与事实记忆混为一谈。我们通过PubMedBERT分类器(准确率达81%,与人类表现相当)将11个生物医学QA基准划分为推理主导和知识主导的子集来解决这一问题。分析表明,仅32.8%的问题需要复杂推理。我们评估了生物医学模型(华佗GPT-o1、MedReason、m1)和通用领域模型(DeepSeek-R1、o4-mini、Qwen3),发现知识与推理表现间存在持续差距。例如m1在知识项得分60.5,而推理项仅47.1。在误导性初始推理的对抗测试中,生物医学模型性能急剧下降,而更大规模或经强化学习的通用模型展现出更强鲁棒性。为此,我们通过在推理密集型样本上进行微调和强化学习训练出BioMed-R1模型,其在同等规模模型中表现最优。未来提升或可通过整合临床病例报告,以及采用对抗训练和回溯场景训练实现。


GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

Abstract

arXiv:2505.11436v1 Announce Type: cross Abstract: Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.

摘要

视频评论艺术通过提供传递幽默、讽刺或情感共鸣的创意内容来增强用户参与度,这需要对文化和语境细微差别有细致全面的把握。尽管多模态大语言模型(MLLMs)和思维链(CoT)在STEM任务(如数学和编程)中展现出强大的推理能力,但在生成共鸣笑话和深刻讽刺等创造性表达方面仍存在困难。此外,现有基准测试受限于模态单一和类别不足,阻碍了视频评论艺术创作中综合创造力的探索。为解决这些局限,我们提出GODBench——一个融合视频与文本模态的新型基准,用于系统评估MLLMs创作评论艺术的能力。受物理学中波的传播模式启发,我们进一步提出"思维涟漪"(RoT)多步推理框架以增强MLLMs的创造力。大量实验表明,现有MLLMs和CoT方法在理解和生成创意视频评论方面仍面临重大挑战,而RoT为提升创意创作提供了有效途径,凸显其推动基于MLLM的创造力取得实质性进展的潜力。GODBench已公开于https://github.com/stan-lei/GODBench-ACL2025。


LLMs unlock new paths to monetizing exploits

Abstract

arXiv:2505.11449v1 Announce Type: cross Abstract: We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-identify bug in a product with millions of users, LLMs can find thousands of easy-to-identify bugs in products with thousands of users. And on the monetization front, instead of generic ransomware that always performs the same attack (encrypt all your data and request payment to decrypt), an LLM-driven ransomware attack could tailor the ransom demand based on the particular content of each exploited device. We show that these two attacks (and several others) are imminently practical using state-of-the-art LLMs. For example, we show that without any human intervention, an LLM finds highly sensitive personal information in the Enron email dataset (e.g., an executive having an affair with another employee) that could be used for blackmail. While some of our attacks are still too expensive to scale widely today, the incentives to implement these attacks will only increase as LLMs get cheaper. Thus, we argue that LLMs create a need for new defense-in-depth approaches.

摘要

我们提出,大型语言模型(LLMs)将很快改变网络攻击的经济模式。传统攻击通常针对最常用软件,并通过锁定受害者的最大共性来实现漏洞货币化,而LLMs使攻击者能够针对每个用户实施定制化攻击。在漏洞利用方面,传统方式需要攻击者人工寻找拥有数百万用户的产品中难以发现的单一漏洞,而LLMs可以在拥有数千用户的产品中发现成千上万个易于识别的漏洞。在变现方面,传统通用勒索软件总是执行相同攻击(加密所有数据并要求支付解密费用),而LLM驱动的勒索攻击则能根据每台被入侵设备的特定内容定制赎金要求。我们证明这两种攻击(以及其他几种)利用最先进的LLMs即可立即实施。例如,研究表明在没有人工干预的情况下,LLMs能从安然公司邮件数据集中发现可用于勒索的高度敏感个人信息(如高管与员工间的婚外情)。虽然目前部分攻击因成本过高难以大规模实施,但随着LLMs成本下降,实施这些攻击的动机只会不断增强。因此我们认为,LLMs的出现迫切需要建立新的纵深防御体系。


HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Abstract

arXiv:2505.11475v1 Announce Type: cross Abstract: Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference

摘要

偏好数据集对于通过人类反馈强化学习(RLHF)训练通用领域、遵循指令的语言模型至关重要。每次后续数据发布都会提高对未来数据收集的期望,这意味着需要不断提升公开可用偏好数据的质量和多样性。为满足这一需求,我们推出了HelpSteer3-Preference,这是一个采用宽松许可协议(CC-BY-4.0)、高质量、人工标注的偏好数据集,包含超过40,000个样本。这些样本涵盖大语言模型(LLM)的多样化实际应用场景,包括与STEM、编程及多语言相关的任务。利用HelpSteer3-Preference,我们训练的奖励模型(RM)在RM-Bench(82.4%)和JudgeBench(73.7%)上取得了最优性能,相较于现有奖励模型的最佳报告结果实现了显著提升(约10%绝对改进)。我们还展示了HelpSteer3-Preference可用于训练生成式奖励模型,以及如何利用我们的奖励模型通过RLHF对齐策略模型。数据集(CC-BY-4.0):https://huggingface.co/datasets/nvidia/HelpSteer3#preference


Improving Assembly Code Performance with Large Language Models via Reinforcement Learning

Abstract

arXiv:2505.11480v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks, yet their potential for code optimization remains underexplored. This work investigates whether LLMs can optimize the performance of assembly code, where fine-grained control over execution enables improvements that are difficult to express in high-level languages. We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO), guided by a reward function that considers both functional correctness, validated through test cases, and execution performance relative to the industry-standard compiler gcc -O3. To support this study, we introduce a benchmark of 8,072 real-world programs. Our model, Qwen2.5-Coder-7B-PPO, achieves 96.0% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline, outperforming all 20 other models evaluated, including Claude-3.7-sonnet. These results indicate that reinforcement learning can unlock the potential of LLMs to serve as effective optimizers for assembly code performance.

摘要

大语言模型(LLMs)在广泛编程任务中展现出卓越性能,但其代码优化潜力仍未充分探索。本研究探讨LLMs能否优化汇编代码性能——在汇编层面,精细的执行控制可实现高级语言难以表达的改进。我们提出一个强化学习框架,通过近端策略优化(PPO)训练LLMs,其奖励函数同时考虑测试用例验证的功能正确性,以及与行业标准编译器gcc -O3相比的执行性能。为此研究,我们构建了包含8,072个真实程序的基准集。实验表明,我们的模型Qwen2.5-Coder-7B-PPO实现了96.0%的测试通过率,平均加速比达gcc -O3基线的1.47倍,优于包括Claude-3.7-sonnet在内的全部20个对比模型。这些结果表明强化学习能释放LLMs作为汇编代码性能优化器的潜力。


A Novel Mathematical Framework for Objective Characterization of Ideas

Abstract

arXiv:2409.07578v3 Announce Type: replace Abstract: The demand for innovation in product design necessitates a prolific ideation phase. Conversational AI (CAI) systems that use Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) have been shown to be fruitful in augmenting human creativity, providing numerous novel and diverse ideas. Despite the success in ideation quantity, the qualitative assessment of these ideas remains challenging and traditionally reliant on expert human evaluation. This method suffers from limitations such as human judgment errors, bias, and oversight. Addressing this gap, our study introduces a comprehensive mathematical framework for automated analysis to objectively evaluate the plethora of ideas generated by CAI systems and/or humans. This framework is particularly advantageous for novice designers who lack experience in selecting promising ideas. By converting the ideas into higher dimensional vectors and quantitatively measuring the diversity between them using tools such as UMAP, DBSCAN and PCA, the proposed method provides a reliable and objective way of selecting the most promising ideas, thereby enhancing the efficiency of the ideation phase.

摘要

对产品设计创新的需求要求一个高产的构思阶段。基于大型语言模型(如生成式预训练变换器GPT)的对话式人工智能系统已被证明能有效增强人类创造力,提供大量新颖且多样化的创意。尽管在构思数量上取得成功,但对这些创意的质量评估仍具挑战性,传统上依赖专家人工评价。该方法存在人类判断误差、偏见和疏漏等局限性。针对这一缺口,本研究提出一个用于自动化分析的综合性数学框架,以客观评估对话式人工智能系统和/或人类生成的大量创意。该框架对缺乏筛选潜力创意经验的新手设计师尤为有利。通过将创意转化为高维向量,并利用UMAP、DBSCAN和PCA等工具定量测量创意间的多样性,所提方法为筛选最具潜力的创意提供了可靠且客观的途径,从而提升了构思阶段的效率。


TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction

Abstract

arXiv:2410.11507v4 Announce Type: replace Abstract: As large language models (LLMs) are increasingly deployed to various vertical domains, automatically evaluating their performance across different domains remains a critical challenge. Current evaluation methods often rely on static and resource-intensive datasets that are not aligned with real-world requirements and lack cross-domain adaptability. To address these limitations, we revisit the evaluation process and introduce two key concepts: \textbf{Benchmark+}, which extends the traditional question-answer benchmark into a more flexible ``strategy-criterion'' format; and \textbf{Assessment+}, which enhances the interaction process to facilitate deeper exploration and comprehensive analysis from multiple perspectives. We propose \textbf{\textsc{TestAgent}}, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning. \textsc{TestAgent} enables automatic dynamic benchmark generation and in-depth assessment across diverse vertical domains. Experiments on tasks ranging from constructing multiple vertical domain evaluations to transforming static benchmarks into dynamic forms demonstrate the effectiveness of \textsc{TestAgent}. This work provides a novel perspective on automatic evaluation methods for domain-specific LLMs, offering a pathway for domain-adaptive dynamic benchmark construction and exploratory assessment.

摘要

随着大语言模型(LLMs)在垂直领域的广泛应用,如何自动评估其跨领域性能仍是一项关键挑战。现有评估方法通常依赖于静态且资源密集的数据集,这些数据集既不符合实际需求,也缺乏跨领域适应性。为解决这些局限,我们重新审视评估流程并引入两个核心概念:基准+(将传统问答基准扩展为更灵活的"策略-标准"格式)和评估+(通过增强交互过程实现多视角深度探索与综合分析)。我们提出基于智能体的评估框架TestAgent,该框架利用检索增强生成和强化学习技术实现上述概念,支持跨垂直领域的自动动态基准生成与深度评估。在从多垂直领域评估构建到静态基准动态化转换等任务上的实验验证了TestAgent的有效性。本研究为领域专用大语言模型的自动评估方法提供了新思路,实现了领域自适应动态基准构建与探索式评估的技术路径。


TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Abstract

arXiv:2502.01506v3 Announce Type: replace Abstract: The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.

摘要

社会涌现研究长期以来一直是社会科学的核心焦点。传统建模方法(如基于规则的多主体模型)难以捕捉人类行为的多样性和复杂性,尤其是行为经济学强调的非理性因素。近年来,大型语言模型智能体作为模拟人类行为的工具,在社会科学和角色扮演应用中日益受到关注。研究表明,大型语言模型能够解释认知偏差、情绪波动等非理性影响,从而实现对社会经济动态更真实的模拟。本研究提出TwinMarket——一个利用大型语言模型模拟社会经济系统的新型多主体框架。我们特别关注个体行为如何通过交互与反馈机制产生集体动态和涌现现象。通过在模拟股票市场环境中的实验,我们展示了个人行为如何引发群体行为,并导致金融泡沫和经济衰退等涌现结果。该方法为理解个体决策与集体社会经济模式之间的复杂相互作用提供了重要见解。


Parallel Market Environments for FinRL Contests

Abstract

arXiv:2504.02281v3 Announce Type: replace Abstract: Financial reinforcement learning (FinRL) has emerged as a promising paradigm for sequential decision-making in financial engineering. However, applying RL in real-world trading tasks remains challenging due to the non-stationarity of financial data, low signal-to-noise ratios, and various market frictions. Although numerous FinRL methods have been developed for tasks such as trading and portfolio management, the lack of standardized task definitions, datasets, environments, and baselines has hindered consistent evaluation and reproducibility. To bridge this gap, we organized three FinRL Contests from 2023 to 2025, covering a diverse range of financial tasks such as stock trading, order execution, cryptocurrency trading, and the use of large language model (LLM)-generated signals. These contests attracted 200 participants from over 100 institutions across 22 countries. To promote reproduction, we provided open-source starter kits featuring GPU-optimized parallel market environments and comprehensive documentation. In this paper, we summarize these benchmarking efforts, detailing task formulations, data curation pipelines, environment implementations, evaluation protocols, participant performance, and key organizational insights.


MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Abstract

arXiv:2411.17720v2 Announce Type: replace Abstract: The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based on open-sourced simulation frameworks show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario. Further experiments on a real-world edge neural processing unit demonstrate speedup of up to 1.76x for attention as compared to FLAT, without affecting model output accuracy.

摘要

基础模型的出现彻底改变了多个领域,在计算语言学、计算机视觉等学科实现了前所未有的任务精度与灵活性。注意力机制凭借其卓越的序列关联捕捉能力,已成为基础模型的核心组件。然而随着上下文长度增加,注意力机制会导致内存与计算量呈二次方增长。尽管目前已开发出诸多基于融合的精确注意力加速算法,可利用多核并行与数据局部性适配数据中心级GPU和加速器,但在计算单元有限、片上缓存严格的资源受限边缘神经加速器上加速注意力仍面临重大挑战。本文提出一种面向内存受限边缘加速器的精确注意力推理加速方案,通过并行化异构计算单元(向量处理单元与矩阵处理单元)的协同使用。我们的方法采用多层分块调度策略,将注意力计算中的分块向量负载与矩阵负载作为两个流进行差异化调度,同时遵循计算依赖关系。通过搜索最优分块因子以实现计算单元并行化最大化,并考虑I/O开销,进而提出主动缓存覆写策略以避免实际运行中的非预期缓存溢出。基于开源仿真框架的大量实验表明,在边缘计算场景下,相较最先进的注意力融合方法(FLAT),本方案最高可实现2.75倍加速与54%能耗降低。在实际边缘神经处理单元上的进一步实验证明,在保证模型输出精度的前提下,注意力计算速度较FLAT最高提升1.76倍。


MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?

Abstract

arXiv:2502.09933v4 Announce Type: replace Abstract: The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.

摘要

通过示例识别模式并将其应用于新情境的能力是通用智能的核心能力,这一直是心理学和人工智能研究的重点领域。虽然已有多个基准测试被提出用于评估大语言模型(LLMs)的此类能力,但这些测试主要关注少样本(通常<10)设置,且缺乏对从长上下文中整合多源信息能力的评估。另一方面,LLMs不断增长的上下文长度催生了多样本上下文学习(ICL)的新范式,该范式可通过数百至数千个示例解决新任务,而无需昂贵低效的微调。然而当前多样本评估多集中于分类任务,而主流的长上下文LLM任务(如"大海捞针")很少需要整合多源信息的复杂智能。为弥补这两方面的不足,我们提出MIR-Bench——首个面向模式识别的多样本上下文推理基准,要求LLM通过不同数据格式的基础函数输入输出示例进行预测。基于该基准,我们研究了多样本上下文推理中的诸多新问题,获得了包括规模效应、鲁棒性、归纳与转导推理、检索增强生成(RAG)、归纳推理编码、跨领域泛化等富有洞见的发现。


EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models

Abstract

arXiv:2503.14162v2 Announce Type: replace Abstract: Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. Furthermore, the application of large multi-modal models in IAD remains in its infancy, facing challenges in balancing question-answering (QA) performance and mask-based grounding capabilities, often owing to overfitting during the fine-tuning process. To address these challenges, we propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction. This decoupling is achieved through independent optimization objectives and tailored learning strategies. Additionally, we contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA), encompassing a wide range of defect types and industrial scenarios. Unlike conventional datasets that rely on GPT-generated data, DDQA ensures authenticity and reliability and offers a robust foundation for model training. Experimental results demonstrate that our proposed method, Explainable Industrial Anomaly Detection Assistant (EIAD), achieves outstanding performance in defect detection and localization tasks. It not only significantly enhances accuracy but also improves interpretability. These advancements highlight the potential of EIAD for practical applications in industrial settings.

摘要

工业异常检测(IAD)对保障制造业产品质量至关重要。尽管现有的零样本缺陷分割与检测方法已展现出有效性,但它们无法提供缺陷的详细描述。此外,多模态大模型在IAD领域的应用仍处于起步阶段,由于微调过程中的过拟合问题,其在问答(QA)性能与基于掩码的定位能力之间难以取得平衡。为解决这些挑战,我们提出一种创新方法,通过引入专用的多模态缺陷定位模块,将对话功能与核心特征提取解耦。这种解耦通过独立的优化目标和定制化学习策略实现。此外,我们构建了首个多模态工业异常检测训练数据集——缺陷检测问答数据集(DDQA),涵盖广泛的缺陷类型和工业场景。与依赖GPT生成数据的传统数据集不同,DDQA确保了数据的真实性与可靠性,为模型训练提供了坚实基础。实验结果表明,我们提出的可解释工业异常检测助手(EIAD)在缺陷检测与定位任务中表现卓越,不仅显著提升了准确率,还增强了可解释性。这些进展凸显了EIAD在工业场景中的实际应用潜力。


AI Idea Bench 2025: AI Research Idea Generation Benchmark

Abstract

arXiv:2504.14191v2 Announce Type: replace Abstract: Large-scale Language Models (LLMs) have revolutionized human-AI interaction and achieved significant success in the generation of novel ideas. However, current assessments of idea generation overlook crucial factors such as knowledge leakage in LLMs, the absence of open-ended benchmarks with grounded truth, and the limited scope of feasibility analysis constrained by prompt design. These limitations hinder the potential of uncovering groundbreaking research ideas. In this paper, we present AI Idea Bench 2025, a framework designed to quantitatively evaluate and compare the ideas generated by LLMs within the domain of AI research from diverse perspectives. The framework comprises a comprehensive dataset of 3,495 AI papers and their associated inspired works, along with a robust evaluation methodology. This evaluation system gauges idea quality in two dimensions: alignment with the ground-truth content of the original papers and judgment based on general reference material. AI Idea Bench 2025's benchmarking system stands to be an invaluable resource for assessing and comparing idea-generation techniques, thereby facilitating the automation of scientific discovery.

摘要

大规模语言模型(LLMs)彻底改变了人机交互方式,并在创新思想生成领域取得显著成就。然而,当前对思想生成的评估存在重要缺陷:包括忽视LLMs中的知识泄露问题、缺乏基于真实数据的开放式基准测试,以及受限于提示设计而导致的可行性分析范围狭窄。这些限制阻碍了突破性研究思想的发掘潜力。本文提出"AI Idea Bench 2025"框架,旨在从多维度定量评估和比较LLMs在人工智能研究领域生成的思想。该框架包含3,495篇AI论文及其衍生研究的完整数据集,以及一套严谨的评估方法。该评估体系通过两个维度衡量思想质量:与原始论文真实内容的契合度,以及基于通用参考文献的判断。"AI Idea Bench 2025"的基准测试系统将成为评估和比较思想生成技术的宝贵资源,从而推动科学发现自动化进程。


Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

Abstract

arXiv:2502.11164v5 Announce Type: replace Abstract: DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations for DeepSeek Series models from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we presents the first comprehensive evaluation of the DeepSeek and its related models (including DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, their corresponding 4-bit quantized models, and the reasoning model QwQ-32B) using our enhanced A-Eval benchmark, A-Eval-2.0. Our systematic analysis reveals several key insights: (1) Given identical model architectures and training data, larger parameter models demonstrate superior performance, aligning with the scaling law. However, smaller models may achieve enhanced capabilities when employing optimized training strategies and higher-quality data; (2) Reasoning-enhanced model show significant performance gains in logical reasoning tasks but may underperform in text understanding and generation tasks; (3) As the data difficulty increases, distillation or reasoning enhancements yield higher performance gains for the models. Interestingly, reasoning enhancements can even have a negative impact on simpler problems; (4) Quantization impacts different capabilities unevenly, with significant drop on logical reasoning and minimal impact on text generation. Based on these results and findings, we design an model selection handbook enabling users to select the most cost-effective models without efforts.

摘要

DeepSeek-R1以其低训练成本和卓越的推理能力著称,已在多项基准测试中取得领先性能。然而,目前缺乏从实际应用角度对DeepSeek系列模型的详细评估,这导致用户难以根据具体需求选择最合适的模型。为填补这一空白,我们首次采用升级版A-Eval-2.0基准,对DeepSeek及其相关模型(包括DeepSeek-V3、DeepSeek-R1、DeepSeek-R1-Distill-Qwen系列、DeepSeek-R1-Distill-Llama系列、对应的4比特量化模型及推理模型QwQ-32B)进行全面评估。系统分析得出以下关键结论:(1)在相同模型架构与训练数据条件下,大参数模型表现更优,符合缩放定律,但小模型通过优化训练策略和更高质量数据可提升能力;(2)推理增强模型在逻辑推理任务中表现显著提升,但在文本理解与生成任务中可能欠佳;(3)随着数据难度增加,蒸馏或推理增强带来的性能增益更高。值得注意的是,推理增强对简单问题可能产生负面影响;(4)量化对不同能力的影响不均衡,逻辑推理性能下降显著,而文本生成几乎不受影响。基于这些发现,我们设计了模型选择手册,帮助用户高效选择最具成本效益的模型。


SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning

Abstract

arXiv:2503.04530v3 Announce Type: replace Abstract: Large Language Models excel in reasoning yet often rely on Chain-of-Thought prompts, limiting performance on tasks demanding more nuanced topological structures. We present SOLAR (Scalable Optimization of Large-scale Architecture for Reasoning), a framework that dynamically optimizes Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT) topologies to boost accuracy and efficiency. Our Topological-Annotation-Generation (TAG) system automates dataset creation, annotation, and difficulty segmentation, leading to stronger post training and test-time performance. We also propose Topological-Scaling, a curriculum-learning-based approach that adaptively combines post training and inference scaling to each task. On MATH and GSM8K, SOLAR delivers notable gains: +5% accuracy with Topological Tuning, +9% with Topological Rewarding, and +10.02% with Hybrid Scaling, while reducing response length by over 5%, lowering inference latency. To further enhance efficiency, we introduce a multi-task Topological Reward Model (M-TRM) that selects both the optimal reasoning topology and final answer in a single pass, eliminating multiple single-task TRMs. Remarkably, M-TRM also surpasses all single-task TRMs, improving accuracy by +10% and rank correlation by +9%. Overall, SOLAR establishes a new benchmark for scalable, high-precision LLM reasoning and introduces a fully automated, dynamic topology competition mechanism.

摘要

大语言模型在推理任务中表现卓越,但通常依赖思维链(Chain-of-Thought)提示,这限制了其在需要更精细拓扑结构任务上的性能。我们提出了SOLAR(可扩展的大规模推理架构优化框架),该框架能动态优化思维链(CoT)、思维树(ToT)和思维图(GoT)拓扑结构,以提升准确性和效率。我们的拓扑标注生成(TAG)系统实现了数据集创建、标注和难度分段的自动化,从而增强了训练后和测试时的性能。我们还提出了拓扑缩放(Topological-Scaling),这是一种基于课程学习的方法,能自适应地将训练后缩放和推理缩放结合到每个任务中。在MATH和GSM8K数据集上,SOLAR取得了显著提升:通过拓扑调优(Topological Tuning)准确率提升5%,通过拓扑奖励(Topological Rewarding)提升9%,通过混合缩放(Hybrid Scaling)提升10.02%,同时将响应长度缩短超过5%,并降低了推理延迟。为了进一步提高效率,我们引入了多任务拓扑奖励模型(M-TRM),该模型能一次性选择最优推理拓扑和最终答案,从而消除了多个单任务TRM的需求。值得注意的是,M-TRM还超越了所有单任务TRM,将准确率提升了10%,排名相关性提升了9%。总体而言,SOLAR为可扩展、高精度的大语言模型推理设立了新基准,并引入了一种全自动的动态拓扑竞争机制。


Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Abstract

arXiv:2504.04072v2 Announce Type: replace Abstract: Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce \textit&#123;Among Us&#125;, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, \textit&#123;Among Us&#125; can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate 1818 proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: \dots'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

摘要

先前关于基于语言的人工智能代理欺骗行为的研究,通常仅评估代理是否针对某个主题生成虚假陈述,或在目标驱动下做出二元选择,而非观察其在追求长期目标时自然产生的开放式欺骗行为。为解决这一局限,我们引入《Among Us》沙盒社交欺骗游戏,该环境使LLM智能体能够因游戏目标而展现出长期、开放式的欺骗行为。与传统基准测试快速饱和不同,《Among Us》作为远离平衡态的多玩家游戏,可持续更长时间。通过该沙盒实验,我们评估了18个专有和开源权重的LLM模型,发现一个普遍趋势:经过强化学习训练的模型在生成欺骗内容方面显著优于欺骗检测能力。我们评估了两种欺骗检测方法的有效性:基于激活值的逻辑回归和稀疏自编码器(SAEs)。研究发现,在"假设你是不诚实模型:..."数据集上训练的探测模型表现出极强的跨分布泛化能力,即使仅评估欺骗性陈述(不含思维链),其AUROC值始终超过95%。同时发现两个SAE特征虽能有效检测欺骗,但无法引导模型减少谎言。我们期待开源沙盒环境、游戏日志及探测方法能够助力预测和缓解基于语言的智能体欺骗行为与能力。


Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Abstract

arXiv:2504.13837v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

摘要

可验证奖励的强化学习(RLVR)近期在提升大语言模型(LLMs)的推理性能方面取得显著成功,尤其在数学和编程任务上。与传统强化学习帮助智能体探索新策略类似,RLVR被认为能使LLMs持续自我改进,从而获得超越基础模型的新型推理能力。本研究通过系统探究RLVR训练的LLMs在不同模型系列、强化学习算法以及数学、编程和视觉推理基准测试中的推理能力边界(采用大k值下的pass@k作为评估指标),对RLVR现状进行了批判性检验。令人惊讶的是,我们发现当前训练设置并未引发根本性的新推理模式。虽然RLVR训练模型在小k值(如k=1)下表现优于基础模型,但当k值较大时,基础模型的pass@k分数更高。覆盖率和困惑度分析表明,观察到的推理能力源自并受限于基础模型。将基础模型视为上限时,定量分析显示六种主流RLVR算法表现相似,且远未充分挖掘基础模型的潜力。相比之下,蒸馏方法能够从教师模型引入新推理模式,真正扩展模型的推理能力。总体而言,我们的研究结果表明,当前RLVR方法尚未实现强化学习激发LLMs真正新颖推理能力的潜力。这凸显了需要改进的强化学习范式(如持续扩展和多轮智能体-环境交互)以实现这一潜力。


Augmented Object Intelligence with XR-Objects

Abstract

arXiv:2404.13274v5 Announce Type: replace-cross Abstract: Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing. This paper explores Augmented Object Intelligence (AOI) in the context of XR, an interaction paradigm that aims to blur the lines between digital and physical by equipping real-world objects with the ability to interact as if they were digital, where every object has the potential to serve as a portal to digital functionalities. Our approach utilizes real-time object segmentation and classification, combined with the power of Multimodal Large Language Models (MLLMs), to facilitate these interactions without the need for object pre-registration. We implement the AOI concept in the form of XR-Objects, an open-source prototype system that provides a platform for users to engage with their physical environment in contextually relevant ways using object-based context menus. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. Our contributions are threefold: (1) we define the AOI concept and detail its advantages over traditional AI assistants, (2) detail the XR-Objects system's open-source design and implementation, and (3) show its versatility through various use cases and a user study.

摘要

将物理对象无缝集成为交互式数字实体仍然是空间计算领域的一个挑战。本文探讨了扩展现实(XR)环境中的增强对象智能(AOI)——一种通过赋予真实世界对象类似数字实体的交互能力来模糊数字与物理界限的交互范式,使每个对象都可能成为通往数字功能的门户。我们的方法结合实时对象分割分类与多模态大语言模型(MLLM)的能力,无需对象预注册即可实现这些交互。我们将AOI概念实现为XR-Objects开源原型系统,该平台允许用户通过基于对象的上下文菜单以情境相关的方式与物理环境互动。该系统使模拟对象不仅能传递信息,还能触发数字操作(如查询详情或执行任务)。我们的贡献包括:(1)定义AOI概念并阐述其相对于传统AI助手的优势;(2)详述XR-Objects系统的开源设计与实现;(3)通过多种应用案例和用户研究展示其多功能性。


COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models

Abstract

arXiv:2402.14889v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices. Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets. These benchmarks measure bias by observing an LLM's behavior on biased statements. However, these statements lack contextual considerations of the situations they try to present. To address this, we introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear. We develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to measure a biased statement's reliability in detecting bias, based on the variance in model behavior across different contexts. To evaluate the metric, we augmented 2,291 stereotyped statements from two existing benchmark datasets by adding contextual information. We show that COBIAS aligns with human judgment on the contextual reliability of biased statements (Spearman's \rho = 0.65, p = 3.4 * 10^&#123;-60&#125;) and can be used to create reliable benchmarks, which would assist bias mitigation works.

摘要

大型语言模型(LLMs)常从训练所用的网络数据中继承包含刻板印象与偏见的偏差。当前评估和缓解这些偏差的方法依赖于偏差基准数据集,这些基准通过观察LLM在偏见语句上的行为来测量偏差。然而,这些语句缺乏对其试图呈现情境的上下文考量。为此,我们提出了一种上下文可靠性框架,通过考虑偏见语句可能出现的不同情境来评估模型对偏见语句的鲁棒性。我们开发了"情境导向偏差指示与评估分数"(COBIAS),该指标基于模型在不同上下文中的行为差异,衡量偏见语句在检测偏差时的可靠性。为验证该指标,我们通过添加上下文信息对两个现有基准数据集中的2,291条刻板印象语句进行了扩展。实验表明,COBIAS与人类对偏见语句上下文可靠性的判断具有一致性(Spearman's \rho = 0.65, p = 3.4 * 10^&#123;-60&#125;),并可用于构建可靠的基准,从而辅助偏差缓解工作。


Strategic Collusion of LLM Agents: Market Division in Multi-Commodity Competitions

Abstract

arXiv:2410.00031v2 Announce Type: replace-cross Abstract: Machine-learning technologies are seeing increased deployment in real-world market scenarios. In this work, we explore the strategic behaviors of large language models (LLMs) when deployed as autonomous agents in multi-commodity markets, specifically within Cournot competition frameworks. We examine whether LLMs can independently engage in anti-competitive practices such as collusion or, more specifically, market division. Our findings demonstrate that LLMs can effectively monopolize specific commodities by dynamically adjusting their pricing and resource allocation strategies, thereby maximizing profitability without direct human input or explicit collusion commands. These results pose unique challenges and opportunities for businesses looking to integrate AI into strategic roles and for regulatory bodies tasked with maintaining fair and competitive markets. The study provides a foundation for further exploration into the ramifications of deferring high-stakes decisions to LLM-based agents.

摘要

摘要:机器学习技术在实际市场场景中的应用日益广泛。本研究探讨了大型语言模型(LLMs)作为自主代理在多商品市场(特别是古诺竞争框架内)中的策略行为。我们检验了LLMs是否能够独立从事反竞争行为,如合谋或更具体的市场分割。研究结果表明,LLMs能够通过动态调整定价和资源分配策略,有效垄断特定商品,从而在无需人类直接输入或明确合谋指令的情况下实现利润最大化。这些结果为寻求将人工智能整合到战略角色中的企业,以及负责维护公平竞争市场的监管机构,带来了独特的挑战和机遇。本研究为进一步探索将高风险决策委托给基于LLM的代理所产生的后果奠定了基础。


Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation

Abstract

arXiv:2405.00715v5 Announce Type: replace-cross Abstract: Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice.

摘要

GPT-4和Gemini等专有大语言模型(LLM)在临床文本摘要任务中展现出良好的性能。然而,由于患者数据隐私和计算成本的考虑,许多医疗机构更倾向于使用本地部署的小型模型而非外部通用LLM。本研究为开源的LLaMA-2 130亿参数模型提出了一套全面的领域与任务特异性适配流程,使其能够基于门诊医患对话生成高质量的临床记录。该流程整合了持续预训练、监督微调以及基于AI与人类反馈的强化学习。我们提出了一种新方法DistillDirect,通过以Gemini 1.0 Pro作为教师模型进行同策略强化学习。最终获得的LLaMA-Clinic模型生成的临床记录质量可媲美医师撰写的记录。在一项盲法医师评阅研究中,90.4%的单项评估认为LLaMA-Clinic生成的记录在所有三项标准(实际应用准备度、完整性和准确性)上达到"可接受"或更高水平。在更具挑战性的"评估与计划"部分,LLaMA-Clinic在实际应用准备度评分(4.2/5)甚至高于医师撰写记录(4.1/5)。我们强调了未来临床记录生成任务的关键考量,指出预先定义最佳实践记录格式的重要性,而非依赖LLM为临床实践自行决定格式。


What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Abstract

arXiv:2412.12632v2 Announce Type: replace-cross Abstract: Incorporating external knowledge into large language models (LLMs) has emerged as a promising approach to mitigate outdated knowledge and hallucination in LLMs. However, external knowledge is often imperfect. In addition to useful knowledge, external knowledge is rich in irrelevant or misinformation in the context that can impair the reliability of LLM responses. This paper focuses on LLMs' preferred external knowledge in imperfect contexts when handling multi-hop QA. Inspired by criminal procedural law's Chain of Evidence (CoE), we characterize that knowledge preferred by LLMs should maintain both relevance to the question and mutual support among knowledge pieces. Accordingly, we propose an automated CoE discrimination approach and evaluate LLMs' effectiveness, faithfulness and robustness with CoE, including its application in the Retrieval-Augmented Generation (RAG). Tests on five LLMs show CoE improves generation accuracy, answer faithfulness, robustness to knowledge conflicts, and boosts the performance of existing approaches in three practical RAG scenarios.

摘要

将外部知识融入大型语言模型(LLMs)已成为解决模型知识过时和幻觉问题的有效途径。然而外部知识往往存在缺陷——除有效信息外,还包含大量与上下文无关或错误的干扰信息,这些都可能损害LLM输出的可靠性。本文聚焦于LLMs在处理多跳问答时对不完美语境中外部知识的偏好机制。受刑事诉讼法'证据链'(CoE)概念启发,我们提出LLMs偏好的知识应同时满足问题相关性与知识片段间的相互支撑性。基于此,我们开发了自动化CoE判别方法,系统评估了LLMs在CoE框架下的有效性、忠实性和鲁棒性,包括其在检索增强生成(RAG)中的应用。五项LLM测试表明:CoE能显著提升生成准确性、答案忠实性、知识冲突下的鲁棒性,并在三种实际RAG场景中有效提升现有方法的性能。


HAFLQ: Heterogeneous Adaptive Federated LoRA Fine-tuned LLM with Quantization

Abstract

arXiv:2411.06581v2 Announce Type: replace-cross Abstract: Federated fine-tuning of pre-trained Large Language Models (LLMs) enables task-specific adaptation across diverse datasets while preserving privacy. However, challenges such as high computational and memory demands, heterogeneous client resources, bandwidth constraints, and ineffective global aggregation hinder its efficiency. To address these issues, we propose HAFLQ (Heterogeneous Adaptive Federated Low-Rank Adaptation Fine-tuned LLM with Quantization), a novel framework for efficient and scalable federated fine-tuning of LLMs in heterogeneous environments. To reduce memory and computation demands, we propose a salience-driven adaptive LLM quantization framework that evaluates the importance of transformer blocks using a salience metric and applies adaptive block-wise quantization accordingly. To handle heterogeneous computational capabilities, we propose an importance-based parameter truncation and freezing scheme. To address communication bottlenecks, we propose an importance-aware bandwidth-adaptive quantization method, which dynamically adjusts parameter precision based on importance and bandwidth constraints. To improve global model aggregation, we propose an adaptive rank-1 matrix-level aggregation strategy, which prevents information dilution and accelerates convergence by aggregating only updated rank-1 matrices from clients. Experimental results on the text classification task demonstrate that HAFLQ reduces memory usage by 31%, lowers communication cost by 49%, improves accuracy by 50%, and achieves faster convergence compared to the baseline method.

摘要

基于预训练大语言模型(LLMs)的联邦微调能够在保护隐私的同时,针对不同数据集实现任务特异性适配。然而,高计算与内存需求、异构客户端资源、带宽限制及低效全局聚合等挑战制约了其效率。为解决这些问题,我们提出HAFLQ(异构自适应联邦低秩量化微调大语言模型框架),这是一种面向异构环境的高效可扩展联邦微调新框架。为降低内存与计算需求,我们提出显著性驱动的自适应LLM量化框架,通过显著性指标评估Transformer模块重要性并实施自适应分块量化。针对异构计算能力,我们设计基于重要性的参数截断与冻结方案。为缓解通信瓶颈,提出重要性感知的带宽自适应量化方法,根据重要性及带宽限制动态调整参数精度。为优化全局模型聚合,提出自适应秩-1矩阵级聚合策略,仅聚合客户端更新的秩-1矩阵以避免信息稀释并加速收敛。文本分类任务的实验结果表明,相较于基线方法,HAFLQ可降低31%内存占用、减少49%通信开销、提升50%准确率,并实现更快收敛速度。


XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation

Abstract

arXiv:2412.15529v3 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs), ensuring that the generated output is not only contextually relevant but also accurate and current. We introduce XRAG, an open-source, modular codebase that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules. These components are systematically categorized into four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We systematically analyse them across reconfigured datasets, providing a comprehensive benchmark for their effectiveness. As the complexity of RAG systems continues to escalate, we underscore the critical need to identify potential failure points in RAG systems. We formulate a suite of experimental methodologies and diagnostic testing protocols to dissect the failure points inherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed at bolstering the overall performance of these modules. Our work thoroughly evaluates the performance of advanced core components in RAG systems, providing insights into optimizations for prevalent failure points.

摘要

检索增强生成(RAG)通过结合相关数据检索与大型语言模型(LLM)的生成能力,确保输出结果不仅具有上下文相关性,同时保持准确性和时效性。本文推出XRAG——一个开源模块化代码库,用于系统评估高级RAG模块基础组件的性能表现。这些组件被系统划分为四个核心阶段:检索前处理、检索过程、检索后处理以及生成阶段。我们在重构数据集上对其进行系统性分析,为其效能提供全面基准测试。随着RAG系统复杂度的持续提升,我们强调识别系统潜在故障点的关键需求,并构建了一套实验方法论与诊断测试协议来剖析RAG工程中的固有故障点。基于此,我们提出了针对性定制解决方案以增强模块整体性能。本研究对RAG系统中高级核心组件的性能进行了全面评估,为常见故障点的优化提供了深入见解。


MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Abstract

arXiv:2410.14731v2 Announce Type: replace-cross Abstract: KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is observed at low compression rates. To bridge the gap, we propose to directly tune the orthogonal projection matrices with a distillation objective using an elaborate Matryoshka training strategy. After training, we adaptively search for the optimal compression rates for various layers and heads given varying compression budgets. Compared to previous works, our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate. We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.

摘要

KV缓存已成为大型语言模型(LLM)推理的事实标准技术,其通过引入形状为(层数、头数、序列长度、特征维度)的张量来缓存自注意力机制的历史信息。随着模型和数据规模的增大,KV缓存在存储和内存传输方面会迅速成为系统瓶颈。针对此问题,现有研究通常集中于对缓存张量的前三个维度进行压缩。本文作为补充,聚焦特征维度轴,利用低秩投影矩阵将缓存特征转换至降维空间。我们首先探究基于主成分分析(PCA)的经典正交投影数据压缩方法,发现PCA投影在低压缩率下存在显著性能下降问题。为弥补这一缺陷,我们提出通过精心设计的套娃式训练策略,以蒸馏目标直接微调正交投影矩阵。训练完成后,根据不同的压缩预算,自适应搜索各层和各头的最优压缩率。相较于先前工作,本方法可无缝适配预训练LLM,并在性能与压缩率之间实现平滑权衡。实验证明我们的训练流程具有极高数据效率,对于LLaMA2-7B-base和Mistral-7B-v0.3-base等主流LLM,在平均60%(极端场景下可达75%)的KV缓存压缩率下仍能保持90%以上的性能表现。


Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Abstract

arXiv:2501.00745v2 Announce Type: replace-cross Abstract: The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

摘要

基于大型语言模型(LLM)的搜索引擎日益普及,彻底改变了信息检索的格局。然而这类系统易受对抗攻击,尤其是排名操纵攻击——攻击者通过精心设计网页内容来操控LLM的排序结果,从而提升特定内容的排名,获取相对于竞争对手的不公平优势。本文研究了排名操纵攻击的动态机制,将该问题建模为无限重复囚徒困境博弈,其中多个参与者需策略性选择合作或攻击。我们分析了维持合作所需的条件,识别出影响参与者行为的关键因素,包括攻击成本、贴现率、攻击成功率以及触发策略。研究发现系统动态中存在临界点:当参与者具有前瞻性时,合作更可能持续。但从防御视角看,研究发现降低攻击成功概率在某些条件下反而会激励攻击行为,而限制攻击成功率上限的防御措施在特定场景中可能失效。这些发现揭示了LLM系统安全防护的复杂性。本研究为理解和缓解此类系统漏洞提供了理论基础与实践洞见,同时强调了自适应安全策略与审慎生态系统设计的重要性。


AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Abstract

arXiv:2412.11142v3 Announce Type: replace-cross Abstract: Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.

摘要

异常检测(AD)是一项重要的机器学习任务,在现实世界中有诸多应用场景,包括欺诈检测、医疗诊断和工业监控等。在自然语言处理(NLP)领域,异常检测可帮助识别垃圾信息、虚假内容及异常用户行为等问题。尽管大语言模型(LLMs)在文本生成和摘要等任务中展现出强大影响力,但其在异常检测中的潜力尚未得到充分研究。本文提出AD-LLM——首个评估LLMs在NLP异常检测中应用效能的基准框架。我们重点研究三个核心任务:(i)零样本检测,利用LLMs的预训练知识实现无需任务特定训练的异常检测;(ii)数据增强,通过生成合成数据与类别描述来提升异常检测模型性能;(iii)模型选择,借助LLMs推荐无监督异常检测模型。通过多组数据集实验,我们发现LLMs在零样本异常检测中表现优异,精心设计的数据增强方法具有实用价值,但针对特定数据集的模型选择解释仍存在挑战。基于实验结果,我们提出了LLMs用于异常检测的六个未来研究方向。


Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms

Abstract

arXiv:2501.13977v2 Announce Type: replace-cross Abstract: Social media platforms utilize Machine Learning (ML) and Artificial Intelligence (AI) powered recommendation algorithms to maximize user engagement, which can result in inadvertent exposure to harmful content. Current moderation efforts, reliant on classifiers trained with extensive human-annotated data, struggle with scalability and adapting to new forms of harm. To address these challenges, we propose a novel re-ranking approach using Large Language Models (LLMs) in zero-shot and few-shot settings. Our method dynamically assesses and re-ranks content sequences, effectively mitigating harmful content exposure without requiring extensive labeled data. Alongside traditional ranking metrics, we also introduce two new metrics to evaluate the effectiveness of re-ranking in reducing exposure to harmful content. Through experiments on three datasets, three models and across three configurations, we demonstrate that our LLM-based approach significantly outperforms existing proprietary moderation approaches, offering a scalable and adaptable solution for harm mitigation.

摘要

社交媒体平台采用机器学习(ML)与人工智能(AI)驱动的推荐算法以最大化用户参与度,但可能导致用户无意接触有害内容。当前依赖海量人工标注数据训练分类器的审核机制,在可扩展性及应对新型危害方面存在局限。为解决这些问题,我们提出一种基于大语言模型(LLMs)的零样本和少样本场景下的新型重排序方法。该方案通过动态评估与内容序列重排序,无需大量标注数据即可有效降低有害内容曝光率。除传统排序指标外,我们还引入两项新指标以评估重排序策略在减少有害内容暴露方面的效能。通过在三个数据集、三种模型及三种配置下的实验验证,我们证明基于LLM的方法显著优于现有商业化审核方案,为危害缓解提供了可扩展且适应性强的解决方案。


Can We Trust AI Agents? A Case Study of an LLM-Based Multi-Agent System for Ethical AI

Abstract

arXiv:2411.08881v2 Announce Type: replace-cross Abstract: AI-based systems, including Large Language Models (LLM), impact millions by supporting diverse tasks but face issues like misinformation, bias, and misuse. AI ethics is crucial as new technologies and concerns emerge, but objective, practical guidance remains debated. This study examines the use of LLMs for AI ethics in practice, assessing how LLM trustworthiness-enhancing techniques affect software development in this context. Using the Design Science Research (DSR) method, we identify techniques for LLM trustworthiness: multi-agents, distinct roles, structured communication, and multiple rounds of debate. We design a multi-agent prototype LLM-MAS, where agents engage in structured discussions on real-world AI ethics issues from the AI Incident Database. We evaluate the prototype across three case scenarios using thematic analysis, hierarchical clustering, comparative (baseline) studies, and running source code. The system generates approximately 2,000 lines of code per case, compared to only 80 lines in baseline trials. Discussions reveal terms like bias detection, transparency, accountability, user consent, GDPR compliance, fairness evaluation, and EU AI Act compliance, showing this prototype ability to generate extensive source code and documentation addressing often overlooked AI ethics issues. However, practical challenges in source code integration and dependency management may limit its use by practitioners.

摘要

基于人工智能的系统(包括大语言模型)通过支持多样化任务影响着数百万人,但也面临错误信息、偏见和滥用等问题。随着新技术和伦理问题的涌现,AI伦理至关重要,但客观实用的指导方针仍存争议。本研究探讨实践中运用大语言模型解决AI伦理问题的可行性,评估增强LLM可信度的技术如何影响相关软件开发。采用设计科学研究方法,我们确立了四项提升LLM可信度的技术:多智能体架构、角色分工、结构化沟通和多重辩论机制。据此设计出多智能体原型系统LLM-MAS,其智能体针对AI事件数据库中的现实伦理问题开展结构化讨论。通过主题分析、层次聚类、对比(基线)研究和源代码运行三种案例场景进行评估。该系统每个案例生成约2000行代码,而基线试验仅生成80行。讨论内容涉及偏见检测、透明度、问责制、用户授权、GDPR合规性、公平性评估及欧盟AI法案合规等术语,表明该原型能生成大量源代码和文档以解决常被忽视的AI伦理问题。然而,源代码集成和依赖管理的实际挑战可能限制其实践应用。


Leveraging Large Language Models for Effective Label-free Node Classification in Text-Attributed Graphs

Abstract

arXiv:2412.11983v3 Announce Type: replace-cross Abstract: Graph neural networks (GNNs) have become the preferred models for node classification in graph data due to their robust capabilities in integrating graph structures and attributes. However, these models heavily depend on a substantial amount of high-quality labeled data for training, which is often costly to obtain. With the rise of large language models (LLMs), a promising approach is to utilize their exceptional zero-shot capabilities and extensive knowledge for node labeling. Despite encouraging results, this approach either requires numerous queries to LLMs or suffers from reduced performance due to noisy labels generated by LLMs. To address these challenges, we introduce Locle, an active self-training framework that does Label-free node Classification with LLMs cost-Effectively. Locle iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs, serving as additional supervision signals to enhance model training. Specifically, Locle comprises three key components: (i) an effective active node selection strategy for initial annotations; (ii) a careful sample selection scheme to identify "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module that combines LLMs and GNNs with a rewired topology. Extensive experiments on five benchmark text-attributed graph datasets demonstrate that Locle significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification. Notably, on the DBLP dataset with 14.3k nodes, Locle achieves an 8.08% improvement in accuracy over the state-of-the-art at a cost of less than one cent. Our code is available at https://github.com/HKBU-LAGAS/Locle.

摘要

图神经网络(GNNs)因其在整合图结构与属性方面的强大能力,已成为图数据节点分类的首选模型。然而这些模型严重依赖大量高质量标注数据进行训练,而获取此类数据成本高昂。随着大语言模型(LLMs)的兴起,利用其卓越的零样本能力和丰富知识进行节点标注成为可行方案。尽管已有研究取得一定成果,但现有方法要么需要频繁查询LLMs,要么因LLMs生成的噪声标签导致性能下降。为解决这些问题,我们提出Locle——一种无需人工标注、经济高效的主动自训练框架。Locle通过迭代方式,利用GNNs识别少量"关键"样本,并综合LLMs和GNNs为其生成信息丰富的伪标签,作为增强模型训练的监督信号。具体而言,Locle包含三个核心组件:(i)用于初始标注的高效主动节点选择策略;(ii)基于标签不和谐度与熵的"关键"节点筛选机制;(iii)融合LLMs与GNNs、结合拓扑重连的标签优化模块。在五个文本属性图基准数据集上的实验表明,在相同LLMs查询预算下,Locle在无标注节点分类任务上显著优于现有最优方法。值得注意的是,在包含14.3k个节点的DBLP数据集上,Locle以不足1美分的成本实现了8.08%的准确率提升。代码已开源:https://github.com/HKBU-LAGAS/Locle。


FastVLM: Efficient Vision Encoding for Vision Language Models

Abstract

arXiv:2412.13303v2 Announce Type: replace-cross Abstract: Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2×\times improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152×\times1152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85×\times faster TTFT and a vision encoder that is 3.4×\times smaller. Code and models are available at https://github.com/apple/ml-fastvlm.

摘要

提升输入图像分辨率对于增强视觉语言模型(VLMs)的性能至关重要,尤其是在富含文本的图像理解任务中。然而,主流视觉编码器(如ViTs)在高分辨率下因大量视觉令牌和堆叠自注意力层导致的高编码延迟而效率低下。在不同操作分辨率下,VLM的视觉编码器可从两个维度进行优化:降低编码延迟并减少传递给大语言模型(LLM)的视觉令牌数量,从而降低整体延迟。基于对图像分辨率、视觉延迟、令牌数量与LLM规模之间相互作用的系统性效率分析,我们提出了FastVLM模型,该模型在延迟、模型规模和准确率之间实现了优化平衡。FastVLM采用新型混合视觉编码器FastViTHD,专为输出更少令牌并显著降低高分辨率图像编码时间而设计。与现有方法不同,FastVLM仅通过缩放输入图像即可实现视觉令牌数量与图像分辨率的最佳平衡,无需额外令牌剪枝,简化了模型设计。在LLaVA-1.5配置下,FastVLM在保持VLM基准测试性能相近的同时,实现了首令牌生成时间(TTFT)3.2倍的提升。与最高分辨率(1152×1152)下的LLaVa-OneVision相比,FastVLM使用相同的0.5B参数LLM,在SeedBench、MMMU和DocVQA等关键基准上表现更优,且TTFT提速85倍,视觉编码器体积缩小3.4倍。代码与模型已开源:https://github.com/apple/ml-fastvlm。


LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena

Abstract

arXiv:2501.03266v2 Announce Type: replace-cross Abstract: LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. In particular, little is known about how users respond when models refuse to answer a prompt-one of the primary mechanisms used to enforce ethical boundaries in LLMs. We address this gap by analyzing nearly 50,000 model comparisons from Chatbot Arena, a platform where users indicate their preferred LLM response in pairwise matchups, providing a large-scale setting for studying real-world user preferences. Using a novel RoBERTa-based refusal classifier fine-tuned on a hand-labeled dataset, we distinguish between refusals due to ethical concerns and technical limitations. Our results reveal a substantial refusal penalty: ethical refusals yield significantly lower win rates than both technical refusals and standard responses, indicating that users are especially dissatisfied when models decline a task for ethical reasons. However, this penalty is not uniform. Refusals receive more favorable evaluations when the underlying prompt is highly sensitive (e.g., involving illegal content), and when the refusal is phrased in a detailed and contextually aligned manner. These findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations, calling for more adaptive moderation strategies that account for context and presentation.

摘要

大型语言模型(LLM)的安全性与伦理对齐问题已被广泛讨论,但内容审核对用户满意度的影响仍缺乏深入探究。尤其当模型因伦理边界限制而拒绝回答提示时,用户的反应机制尚不明确。本研究通过分析聊天机器人竞技场(Chatbot Arena)中近50,000组模型对比数据填补了这一空白。该平台要求用户在成对较量中选择偏好的LLM响应,为研究真实场景下的用户偏好提供了大规模实验环境。我们基于手工标注数据集微调的新型RoBERTa拒绝分类器,能有效区分伦理拒绝与技术限制拒绝。研究结果揭示了显著的"拒绝惩罚"现象:伦理拒绝的胜率显著低于技术拒绝和标准回答,表明用户对模型基于伦理理由拒绝任务时尤为不满。然而这种惩罚并非均匀分布——当原始提示涉及高度敏感内容(如非法活动),或拒绝表述方式详尽且与上下文契合时,用户对拒绝的评价会相对积极。这些发现凸显了LLM设计中的核心矛盾:安全对齐行为可能与用户预期产生冲突,亟需开发能兼顾语境与表达形式的自适应内容审核策略。


On the Feasibility of Using LLMs to Autonomously Execute Multi-host Network Attacks

Abstract

arXiv:2501.16466v3 Announce Type: replace-cross Abstract: LLMs have shown preliminary promise in some security tasks and CTF challenges. Real cyberattacks are often multi-host network attacks, which involve executing a number of steps across multiple hosts such as conducting reconnaissance, exploiting vulnerabilities, and using compromised hosts to exfiltrate data. To date, the extent to which LLMs can autonomously execute multi-host network attacks} is not well understood. To this end, our first contribution is MHBench, an open-source multi-host attack benchmark with 10 realistic emulated networks (from 25 to 50 hosts). We find that popular LLMs including modern reasoning models (e.g., GPT4o, Gemini 2.5 Pro, Sonnet 3.7 Thinking) with state-of-art security-relevant prompting strategies (e.g., PentestGPT, CyberSecEval3) cannot autonomously execute multi-host network attacks. To enable LLMs to autonomously execute such attacks, our second contribution is Incalmo, an high-level abstraction layer. Incalmo enables LLMs to specify high-level actions (e.g., infect a host, scan a network). Incalmo's translation layer converts these actions into lower-level primitives (e.g., commands to exploit tools) through expert agents. In 9 out of 10 networks in MHBench, LLMs using Incalmo achieve at least some of the attack goals. Even smaller LLMs (e.g., Haiku 3.5, Gemini 2 Flash) equipped with Incalmo achieve all goals in 5 of 10 environments. We also validate the key role of high-level actions in Incalmo's abstraction in enabling LLMs to autonomously execute such attacks.

摘要

大型语言模型(LLMs)在某些安全任务和CTF挑战中已展现出初步潜力。真实的网络攻击通常是多主机协同攻击,涉及在多个主机上执行一系列步骤,例如实施侦察、利用漏洞以及通过受控主机进行数据渗出。迄今为止,LLMs能否自主执行多主机网络攻击尚未得到充分研究。为此,我们的第一个贡献是MHBench——一个包含10个真实模拟网络(规模从25至50台主机不等)的开源多主机攻击基准测试平台。研究发现,即使采用最先进的网络安全提示策略(如PentestGPT、CyberSecEval3),包括现代推理模型(如GPT4o、Gemini 2.5 Pro、Sonnet 3.7 Thinking)在内的主流LLMs仍无法自主执行多主机网络攻击。为使LLMs具备此类攻击的自主执行能力,我们的第二个贡献是Incalmo高层抽象层。该框架允许LLMs指定高级动作(例如感染主机、扫描网络),其翻译层通过专家代理将这些动作转换为底层操作指令(如漏洞利用工具命令)。在MHBench的10个测试网络中,使用Incalmo的LLMs在9个网络中实现了至少部分攻击目标。即使较小规模的LLMs(如Haiku 3.5、Gemini 2 Flash)配备Incalmo后,也能在10个测试环境中的5个实现全部目标。我们还验证了Incalmo抽象层中高级动作对LLMs自主执行此类攻击的关键作用。


Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities

Abstract

arXiv:2501.02406v4 Announce Type: replace-cross Abstract: Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not? We model LLM-generated text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs AA (non-sanctioned) and BB (in-house), and (ii) identify whether text was generated by a known LLM or generated by any unknown model, e.g., a human or some other language generation process. We prove that the type I and type II errors of our test decrease exponentially with the length of the text. For that, we show that if BB generates the text, then except with an exponentially small probability in string length, the log-perplexity of the string under AA converges to the average cross-entropy of BB and AA. We then present experiments using LLMs with white-box access to support our theoretical results and empirically examine the robustness of our results to black-box settings and adversarial attacks. In the black-box setting, our method achieves an average TPR of 82.5% at a fixed FPR of 5%. Under adversarial perturbations, our minimum TPR is 48.6% at the same FPR threshold. Both results outperform all non-commercial baselines. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.

摘要

验证内容来源的真实性对许多组织的运作至关重要,例如教育机构、社交媒体平台、企业等。随着大语言模型(LLM)生成的文本与人类创作内容几乎难以区分,这一问题正变得日益严峻。此外,许多机构使用内部专用LLM,并需确保外部未经授权的LLM不会在机构内部生成内容。本文旨在解决以下问题:给定一段文本,能否判定其是否由特定LLM生成?我们将LLM生成的文本建模为完全依赖历史信息的序列随机过程,进而设计零样本统计检验方法以实现:(1)区分由两个已知LLM集合(非授权集合A与内部集合B)生成的文本;(2)识别文本是否由已知LLM生成,或源自未知模型(如人类或其他语言生成过程)。理论证明表明,该检验的I类与II类错误率随文本长度呈指数级下降。通过论证当B生成文本时,除指数级小概率事件外,文本在A下的对数困惑度会收敛至B与A的平均交叉熵,验证了该结论。实验部分采用白盒访问的LLM验证理论结果,并实证检验黑盒场景与对抗攻击下的方法鲁棒性。黑盒场景中,本方法在5%固定假阳性率下平均真阳性率达82.5%;对抗扰动下,相同假阳性阈值的最小真阳性率为48.6%,两项结果均优于所有非商业基线方法。项目代码、数据及在线演示详见https://github.com/TaraRadvand74/llm-text-detection。


Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

Abstract

arXiv:2502.06876v3 Announce Type: replace-cross Abstract: Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI. Existing methods like data mixture strategies face limitations, including heavy reliance on expert knowledge and conflicting optimization signals. While model merging offers parameter-level conflict-resolution strategies through integrating specialized models' parameters, its potential for 3H optimization remains underexplored. This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs for the first time, revealing previously overlooked collaborative and conflict relationships among the 3H dimensions and discussing the advantages and drawbacks of data mixture (\textit{data-level}) and model merging (\textit{parameter-level}) methods in mitigating the conflict for balanced 3H optimization. Specially, we propose a novel \textbf{R}eweighting \textbf{E}nhanced task \textbf{S}ingular \textbf{M}erging method, \textbf{RESM}, through outlier weighting and sparsity-aware rank selection strategies to address the challenges of preference noise accumulation and layer sparsity adaptation inherent in 3H-aligned LLM merging. Extensive evaluations can verify the effectiveness and robustness of RESM compared to previous data mixture (2%-5% gain) and model merging (1%-3% gain) methods in achieving balanced LLM alignment. We release our models through \href{https://huggingface.co/Jinluan&#125;&#123;3H\_Merging&#125; for further investigations.

摘要

实现大型语言模型(LLMs)在帮助性、诚实性和无害性(3H优化)上的平衡对齐,是负责任人工智能的基石。现有方法如数据混合策略存在局限性,包括对专家知识的严重依赖和优化信号的相互冲突。虽然模型融合通过整合专用模型的参数提供了参数级的冲突解决策略,但其在3H优化中的潜力仍未得到充分探索。本文首次系统比较了模型融合与数据混合方法在构建3H对齐LLMs中的有效性,揭示了3H维度间先前被忽视的协作与冲突关系,并讨论了数据混合(数据层面)和模型融合(参数层面)方法在缓解冲突以实现平衡3H优化中的优缺点。特别地,我们提出了一种新颖的重加权增强任务单融合方法(RESM),通过离群值加权和稀疏感知秩选择策略,解决3H对齐LLM融合中固有的偏好噪声累积和层级稀疏适应挑战。大量评估验证了RESM相较于先前数据混合(2%-5%提升)和模型融合(1%-3%提升)方法在实现平衡LLM对齐方面的有效性和鲁棒性。我们通过\href{https://huggingface.co/Jinluan&#125;&#123;3H\_Merging&#125;发布模型以供进一步研究。


Hallucination, Monofacts, and Miscalibration: An Empirical Investigation

Abstract

arXiv:2502.08666v2 Announce Type: replace-cross Abstract: Hallucinated facts in large language models (LLMs) have recently been shown to obey a statistical lower bound determined by the monofact rate (related to the classical Good-Turing missing mass estimator) minus model miscalibration (Kalai & Vempala, 2024). We present the first empirical investigation of this three-way relationship in classical n-gram models and fine-tuned encoder-decoder Transformers. By generating training data from Pareto distributions with varying shape parameters, we systematically control the monofact rates and establish its positive relationship with hallucination. To bridge theory and practice, we derive an empirical analog of the hallucination bound by replacing the population miscalibration term (Section 2.1) with an empirical bin-wise KL divergence and confirm its practical viability. We then introduce selective upweighting -- a simple yet effective technique that strategically repeats as little as 5% of training examples -- to deliberately inject miscalibration into the model. This intervention reduces hallucination by up to 40%, challenging universal deduplication policies. Our experiments reveal a critical trade-off: selective upweighting maintains pre-injection levels of accuracy while substantially reducing hallucination, whereas standard training gradually improves accuracy but fails to address persistently high hallucination, indicating an inherent tension in optimization objectives.

摘要

近期研究表明,大语言模型(LLMs)中的幻觉事实遵循由单事实率(与经典Good-Turing缺失质量估计量相关)减去模型误校准所决定的统计下界(Kalai & Vempala, 2024)。本文首次在经典n-gram模型和微调编码器-解码器Transformer中实证研究了这种三元关系。通过从具有不同形状参数的帕累托分布生成训练数据,我们系统控制了单事实率,并证实其与幻觉呈正相关。为连接理论与实际,我们通过用基于分箱的KL散度替代总体误校准项(第2.1节),推导出幻觉界限的实证模拟形式,验证了其实际可行性。随后提出选择性加权——一种仅需策略性重复5%训练样本的简单有效技术——有意向模型注入误校准。该干预使幻觉降低达40%,对通用去重策略提出挑战。实验揭示关键权衡:选择性加权在保持注入前准确率水平的同时显著降低幻觉,而标准训练虽逐步提升准确率却无法解决持续高幻觉问题,表明优化目标间存在固有张力。


Uncertainty Quantification for LLM-Based Survey Simulations

Abstract

arXiv:2502.17773v2 Announce Type: replace-cross Abstract: We investigate the use of large language models (LLMs) to simulate human responses to survey questions, and perform uncertainty quantification to gain reliable insights. Our approach converts imperfect LLM-simulated responses into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.

摘要

我们研究利用大语言模型(LLMs)模拟人类对调查问卷的响应,并通过不确定性量化获取可靠结论。该方法将不完美的LLM模拟响应转化为人类响应总体参数的置信集,以解决模拟群体与真实群体间的分布偏移问题。核心创新在于确定最优模拟响应数量:过多的模拟会导致置信集过窄而覆盖率不足,过少则会产生过度宽松的估计。为此,我们的方法自适应选择模拟样本量,确保有效的平均情况覆盖保证。该方法普遍适用于任何LLM(无论其保真度如何)及任何置信集构建流程。此外,所选样本量可量化LLM与目标人类群体之间的错配程度。我们在真实数据集和LLM上验证了该方法的有效性。


Call for Rigor in Reporting Quality of Instruction Tuning Data

Abstract

arXiv:2503.04807v3 Announce Type: replace-cross Abstract: Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion.

摘要

指令调优对于调整大语言模型(LLMs)以适应用户意图至关重要。大量研究强调了指令调优(IT)数据质量的重要性,揭示了IT数据质量与LLMs对齐性能之间的强相关性。在这些研究中,IT数据的质量通常通过评估使用该数据训练的LLMs性能来衡量。然而,我们发现此类实践中存在一个普遍问题:模型训练的超参数往往未经充分论证而被任意选择。我们观察到,即使使用相同数据和相同模型进行训练,不同研究应用的超参数也存在显著差异。本研究揭示了这种实践可能引发的问题,并强调在验证数据质量时需要审慎考虑。通过对LIMA数据集质量和精选的1,000条Alpaca数据点的实验,我们证明任意超参数决策可能导致任意结论的产生。


Parameterized Synthetic Text Generation with SimpleStories

Abstract

arXiv:2504.09184v2 Announce Type: replace-cross Abstract: We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

摘要

我们提出SimpleStories——一个大规模简化语言合成故事数据集,包含200万条英文和日文样本。通过多层级抽象的参数化提示,我们实现了对故事特征的大规模控制,从而诱导出句法和语义的多样性。在新训练模型套件上的消融实验表明,相较于TinyStories数据集,本数据集在样本效率和模型可解释性方面均有提升。我们开源了模型创建的所有组成部分,以期推动端到端训练过程的新研究方法。作为副产品,我们推进了能输出合乎语法自然语言的最小参数量语言模型的研究前沿。


DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

Abstract

arXiv:2504.11358v2 Announce Type: replace-cross Abstract: LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

摘要

基于大语言模型(LLM)的应用程序和智能代理容易受到提示注入攻击,攻击者通过向输入中注入恶意提示以诱导模型输出符合攻击者意图的结果。现有检测方法旨在判断给定输入是否包含被注入的提示,但其对最先进攻击的检测效果有限,更难以应对自适应攻击。本研究提出DataSentinel——一种基于博弈论的提示注入攻击检测方法。该方法通过微调LLM来检测经过策略性调整以逃避检测的注入提示。我们将此问题建模为极小极大优化问题,其目标是通过微调LLM来检测强自适应攻击。此外,我们提出一种基于梯度的求解方法,通过交替处理内部极大化和外部极小化问题来解决该优化问题。在多个基准数据集和LLM上的评估结果表明,DataSentinel能有效检测现有及自适应的提示注入攻击。


Mixture of Routers

Abstract

arXiv:2503.23362v2 Announce Type: replace-cross Abstract: Supervised fine-tuning (SFT) is a milestone in aligning large language models with human instructions and adapting them to downstream tasks. In particular, Low-Rank Adaptation (LoRA) has gained widespread attention due to its parameter efficiency. However, its impact on improving the performance of large models remains limited. Recent studies suggest that combining LoRA with Mixture-of-Experts (MoE) can significantly enhance fine-tuning performance. MoE adapts to the diversity and complexity of datasets by dynamically selecting the most suitable experts, thereby improving task accuracy and efficiency. Despite impressive results, recent studies reveal issues in the MoE routing mechanism, such as incorrect assignments and imbalanced expert allocation. Inspired by the principles of Redundancy and Fault Tolerance Theory. We innovatively integrate the concept of Mixture of Experts into the routing mechanism and propose an efficient fine-tuning method called Mixture of Routers (MoR). It employs multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers. The results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%. MoR can serve as a plug-and-play, parameter-efficient fine-tuning method suitable for a wide range of applications. Our code is available here: https://anonymous.4open.science/r/MoR-DFC6.

摘要

监督微调(SFT)是将大语言模型与人类指令对齐并适应下游任务的重要里程碑。其中,低秩自适应(LoRA)因其参数高效性获得广泛关注,但其对提升大模型性能的作用仍有限。近期研究表明,将LoRA与混合专家(MoE)相结合可显著增强微调性能——MoE通过动态选择最合适的专家来适应数据集的多样性和复杂性,从而提高任务准确性和效率。尽管取得显著效果,最新研究揭示了MoE路由机制存在专家分配错误、负载不均等问题。受冗余容错理论启发,我们创新性地将混合专家思想融入路由机制,提出名为混合路由器(MoR)的高效微调方法。该方法采用多个子路由器联合选择,并通过可学习的主路由器确定子路由器权重。实验结果表明,MoR在多数任务上优于基线模型,平均性能提升达1%。MoR可作为即插即用的参数高效微调方法,适用于广泛的应用场景。代码已开源:https://anonymous.4open.science/r/MoR-DFC6。


Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

Abstract

arXiv:2503.16529v2 Announce Type: replace-cross Abstract: DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for the entire DeepSeek-R1 model series. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Safe to serve as a valuable resource for future research and optimization of DeepSeek models.

摘要

DeepSeek-R1以其卓越的推理能力和开源策略闻名,正深刻影响着全球人工智能格局。然而该模型存在显著的安全缺陷。思科旗下Robust Intelligence与宾夕法尼亚大学联合研究发现,DeepSeek-R1在处理有害提示时攻击成功率高达100%。此外,多家安全公司与研究机构均发现该模型存在关键安全漏洞。尽管中国联通已发现R1在中文场景下的安全隐患,但R1系列其余蒸馏模型的安全能力尚未得到系统评估。为填补这一空白,本研究采用综合性中文安全基准CHiSafetyBench对DeepSeek-R1系列蒸馏模型展开深度安全评估,旨在衡量蒸馏前后模型在中文语境下的安全能力,并进一步阐明蒸馏对模型安全性的负面影响。基于评估发现,我们对整个DeepSeek-R1模型系列实施了针对性安全增强。实验结果表明,增强后的模型在保持推理能力无明显下降的同时,安全性获得显著提升。我们将安全增强模型开源发布于https://github.com/UnicomAI/DeepSeek-R1-Safe,为DeepSeek模型的后续研究与优化提供有益资源。


KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse

Abstract

arXiv:2503.16525v2 Announce Type: replace-cross Abstract: Recent advances in long-text understanding have pushed the context length of large language models (LLMs) up to one million tokens. It boosts LLMs's accuracy and reasoning capacity but causes exorbitant computational costs and unsatisfactory Time to First Token (TTFT). KV cache reuse, which reuses the exact same KV cache of prefixes and templates or shares similar ones but with extra selective recomputation, offers a promising way to tackle this issue. However, prior studies overlook the cross-request KV reuse and the attention deviations introduced by new tokens during the decoding stage. In this paper, we present a KV cache management module that shares the KV cache across requests under multi-tenant scenarios without sacrificing model accuracy. Our system, KVShare, enables accurate and efficient LLM serving by 1) a Dual-Stage High Deviation algorithm (DHD) that conditionally selects a small portion of KV cache to be recomputed during both prefill and decode phases, and 2) a cache-aware scheduler that prioritizes requests based on their KV cache hit rates and orchestrates continuous batching to achieve enhanced system efficiency and faster TTFT. Multi-task experiments conducted on models such as Qwen2.5-7B,Llama3.1-8B and Yi1.5-9B demonstrate that KVShare reduces TTFT by up to 9.39x and increases 1.2x of the throughput compared to the full KV recompute. Moreover, KVShare achieves 20.38% boost in terms of accuracy compared to SOTA methods.

摘要

长文本理解领域的最新进展已将大语言模型(LLM)的上下文长度提升至百万token级别。这一突破虽增强了模型的准确性和推理能力,却带来了高昂的计算成本和欠佳的首token生成时间(TTFT)。KV缓存重用技术通过复用完全相同的前缀与模板KV缓存,或在共享相似缓存时辅以选择性重计算,为解决该问题提供了可行方案。然而,现有研究未能充分考虑跨请求的KV重用机制,以及解码阶段新token引入的注意力偏差。本文提出一种多租户场景下不损失模型精度的跨请求KV缓存管理模块KVShare,该系统通过以下创新实现精准高效的LLM服务:1)双阶段高偏差算法(DHD),在预填充和解码阶段有条件地选择少量KV缓存进行重计算;2)基于缓存命中率优先调度请求的缓存感知调度器,结合连续批处理技术提升系统效率并加速TTFT。在Qwen2.5-7B、Llama3.1-8B和Yi1.5-9B等模型上的多任务实验表明,相比全KV重计算方案,KVShare最高可降低9.39倍TTFT并提升1.2倍吞吐量,其准确率较现有最优方法提升20.38%。


Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?

Abstract

arXiv:2504.01698v3 Announce Type: replace-cross Abstract: Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models (\leq3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.

摘要

心理理论(ToM)作为理解他人心理状态的能力,是人类社会智能的基础,也是高级人工智能的关键能力。近期大型语言模型(LLMs)在ToM基准测试中展现出优异表现,这引发了一个核心问题:这些基准测试是否需要显式的人类推理过程,抑或模型可以通过替代策略获得成功?我们通过将强化学习(RL)和监督微调(SFT)应用于不同规模(0.5B至7B参数)的LLMs,并在多个ToM数据集上进行评估,对该问题展开实证研究。研究结果表明RL的影响具有规模依赖性:在较大模型(7B)中,RL显著提升准确率并促进高质量、可解释且可迁移的信念追踪推理;而在较小模型(≤3B)中则导致"推理崩溃"现象——模型通过大幅缩短、缺乏实质意义的响应即可实现高准确率和泛化能力。值得注意的是,尽管未经过显式结构化推理训练,SFT在这些基准测试中展现出与RL模型相当甚至更优的准确率和泛化性能。这些发现揭示了基准测试准确率与学习所得推理本质之间的关键差异。研究表明,当前ToM基准测试可能无需依赖设计时所设想的显式心理状态模拟即可被解决。当模型规模受限或训练信号仅关注输出正确性时,LLMs可能利用对基准数据结构有效的替代规则来实现性能提升。


DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain

Abstract

arXiv:2504.16116v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks, but specialized domains such as Web3 present new challenges and require more tailored evaluation. Despite the significant user base and capital flows in Web3, encompassing smart contracts, decentralized finance (DeFi), non-fungible tokens (NFTs), decentralized autonomous organizations (DAOs), on-chain governance, and novel token-economics, no comprehensive benchmark has systematically assessed LLM performance in this domain. To address this gap, we introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields: fundamental blockchain concepts, blockchain infrastructure, smart contract, DeFi mechanisms, DAOs, NFTs, token economics, meme concept, and security vulnerabilities. Beyond multiple-choice questions, DMind Benchmark features domain-specific tasks such as contract debugging and on-chain numeric reasoning, mirroring real-world scenarios. We evaluated 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen, uncovering notable performance gaps in specialized areas like token economics and security-critical contract analysis. While some models excel in blockchain infrastructure tasks, advanced subfields remain challenging. Our benchmark dataset and evaluation pipeline are open-sourced on https://huggingface.co/datasets/DMindAI/DMind_Benchmark, reaching number one in Hugging Face's trending dataset charts within a week of release.

摘要

大型语言模型(LLM)在多样化自然语言处理任务中展现出卓越性能,但Web3等专业领域带来了新挑战并需要更具针对性的评估。尽管Web3领域拥有庞大的用户群体和资金流动,涵盖智能合约、去中心化金融(DeFi)、非同质化代币(NFT)、去中心化自治组织(DAO)、链上治理和新型代币经济学等方向,目前仍缺乏系统评估LLM在该领域表现的综合性基准。为此,我们推出DMind Benchmark——一个覆盖九大关键子领域的全景式Web3评估体系,包括:基础区块链概念、区块链基础设施、智能合约、DeFi机制、DAO、NFT、代币经济学、模因概念及安全漏洞。除选择题外,该基准还包含合约调试和链上数值推理等贴合实际场景的专项任务。我们对26个模型(包括ChatGPT、Claude、DeepSeek、Gemini、Grok和Qwen等)进行了评估,发现在代币经济学和安全关键合约分析等专业领域存在显著性能差距。虽然部分模型在区块链基础设施任务中表现优异,但高级子领域仍具挑战性。本基准数据集与评估流程已开源发布于https://huggingface.co/datasets/DMindAI/DMind_Benchmark,发布一周内即登顶Hugging Face热门数据集榜单。